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MtTHODS AND COMPOSITIONS UTILIZING A BRANCH AND TERMINATE ALGORITHM FOR 

PROTEIN DESiGN 

This application is a continuing application ol U.S.3.N. 60/151,818, filed September 1, 1999. 

FIELD OF THE INVENTION 

5 The present invention relates co an apparatus and method tor quantitative protein design and 

optimization. In particular, the invention describes the use ol the Branch and Terminate algorithm 
in protein design. 

BACKGROUND Or THE INVENTION 

De novo protein design has received considerable attention recently, and significant advances 
10 have been made toward the goal of producing stable, well-folaed proteins with novel sequences. 
Efforts to design proteins rely on knowledge of the physical properties that determine protein 
structure, such as the patterns of hydrophobic and hydrophiiic residues in the sequence, salt 
bridges and hydrogen bonds, and secondary structural preferences of amino acids. Various 
approaches to apply these principles have been attempted. For example, the construction of 
1 5 cc-helical and 3-sheet proteins with native-like sequences was attempted by individually selecting 
the residue required at every position in the target fold (Hecht, et a/., Science 249:884-891 (1990); 
Quinn, etaL Proc. Natl. Acad Sri u&a fli-R747-E7*i (1994)). Alternatively, a minimalist 
approach was used to design helical proteins, where the simplest possible sequence believed to be 
consistent with the folded structure was generated (Regan, et a/., Science 241:976-978 (1988); 
20 DeGrado, ef a/., Science 243:622-628 (1989); Handei, et a/., Science 261:879-885 (1993)), with 
varying degrees of success. An experimental method that relies on the hydrophobic and polar 
(HP) pattern of a sequence was developed where a library of sequences with the correct pattern 
for a four heiix bundle was generated by random mutagenesis (Kamtekar, ef a/., Science 262:1680- 
1685 (1993)). Among non de novo approaches, domains of naturally occurring proteins have been 
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modified or coupled together to achieve a desired tertiary organization (Pessi, ef a/., Nature 
362:367-369 (1993); Pomerantz, ef a/., Science 267:93-96 (1995)). 

Though the correct secondary structure and overall tertiary organization seem to have been 
attained by several of the above techniques, many designed proteins appear to lack the structural 
specificity of native proteins. The complementary geometric arrangement of amino acids in the 
folded protein is the root of this specificity and is encoded in the sequence. 

Several groups have applied and experimentally tested systematic, quantitative methods to protein 
dPSion with the goal of developing general design algorithms (Hellinga. et a/.. J. Mol. Biol. 222: 763- 
785 (199D Hurley, ef a/.. J Mol. Biol. 224:1143-1154 (1992); Desjarlais, et a/.. Protein Science 
4 2006-2018 (1995); Harbury. efa/.,Pror Natl Acad. Sci. USA 92:8408-8412 (1995); Klemba, ef 
a, Nat S truc. Biol. 2:368-373 (1995); Nautiyal, ef a/., Biochemistry 34:11645-1 1651 (1995); Betzo, 
ef a/.. Biochemistry 35:6955-6962 (1996); Dahiyat, ef a/.. Protein Science 5:895-903 (1996); Jones. 
Protein Science 3:567-574 (1994); Konoi, ef a/., ProtPins Structure, Function and Genetics 19:244- 
255 (1994)). These algorithms consider the spatial positioning and steric complementarity of side 
chains by explicitly modeling the atoms of sequences under consideration. To date, such 
techniques have typically focused on designing the cores of proteins and have scored sequences 
with van der Waals and sometimes hydrophobic solvation potentials. 

Recent studies using coiled coils have demonstrated that core side-chain packing can be 
combined with explicit backbone flexibility (Harbury et al., PNAS USA 92:8408-8412 (1995); Offer 
& Sessions, J. Mol. Biol. 249:967-987 (1995). In these cases, the goal was to search for backbone 
coordinates that satisfied a fixed amino acid sequence. 

in addition, the qualitative nature of many design approaches has hampered the development of 
improved, second generation, proteins because there are no objective methods for learning from 
past design successes and failures. 

25 Thus, it is an object of the invention to provide computational protein design and optimization via an 
objective, quantitative design technique implemented in connection with a general purpose 
computer. 

SUMMARY OF THE INVENTION 

In accordance with the objects outlined above, the present invention provides methods executed 
30 by a computer under the control of a program, the computer including a memory for storing the 
program. The methods comprise the steps of receiving a protein backbone structure with variable 
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residue positions, establishing a grour! cf potential rctsmers for each of the variable residue 
positions, wherein at least one variable residue position has rotamers ^om at least twe different 
amino acid side chains, and analyzing the interaction of each of the rotamers with all or part of the 
remainder of the protsvn backbone structure to generate a set o f ontimhred protein sequences. The 
5 methods f^her comprise classifying each V3r:ab»e r es!due position as either a core, surface or 
boundary residue. The analyzing step may include a Branch and Terminate (S&T) computation 
either alone or in combination with a Dead- End Elimination (DEE) computation. Generally, the 
analyzing step includes the use of at least one scoring function selected from the group consisting 
of a Van der Waals potential scoring function, a hydrogen bond potential scoring function, an 

10 atomic solvation scoring function, a secondary structure propensity scorns 'unction and an 

electrostatic scoring function. The methods further comprise altering the protein backbone prior to 
the analysis, comprising altering at least one supersocondary structure parameter yalue. The 
methods may further comprise generating a rank ordered !ist of additional optima! sequences from 
the globally optima! protein sequence. Some or all of the protein sequences from the ordered list 

15 may be tested tc produce potential energy test results. 

In an additional aspect, the invention provides nucleic acid sequences encoding a protein 
sequence generated by ths present methods, and expression vectors and host cells containing the 
nucleic acids 

In a further aspect, the indention provides a computer readable memory to direct 3 computer to 
20 function in a specified manner, comprising a side chain module to correlate a group of potential 
rotamers for residue positions of a protein backbone model, and a ranking module to analyze the 
interaction of each of said rotamers with a!! or part of the remainder of said protein to generate a 
set of optimized protein sequences The memory may further comprise an assessment module to 
assess the correspondence between potential energy test results and theoretical potential energy 
25 data. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 illustrates a general purpose computer configured in accordance with an embodiment of 
the invention. 

Figure 2 illustrates processing steps associated with an embodiment of the invention. 

30 Figure 3 illustrates processing steps associated with a ranking module used in accordance with an 
embodiment of the invention. After any DEE step, any one of the previous DEE steps may be 
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repeated. In addition, any one of the DEE steps may be eliminated; for example, original singles 
DEE (step 74) need not be run. 

Figure 4 is a schematic representation of the minimum and maximum quantities (defined in Eq. 24 
to 27) that are used to construct speed enhancements. The minima and maxima are utilized 
5 directly to find the (iJJ^ pair and for the comparison of extrema. The differences between the 
quantities, denoted with arrows, are used to construct the <7^ and q w metrics. 

Figures 5A, 5B, 5C and 5D depict several super-secondary structure parameters for a/p proteins. 
The definitions are similar to those previously developed for a/p proteins (Janin & Chothia, J Mol 
Biol 143:95-128 (1980); Cohen et al., J Mol Biol 156:821-862 (1982)). The helix center is defined 

10 as the average C a position of the residues in the helix. The helix axis is defined as the principal 
moment of the C a atoms of the residues in the helix. (Chothia et al., Proc Natl Acad Sci USA 
78:4146-4150 (1981); J Mol Biol 145:215-250 (1981). The strand axis is defined as the average 
of the least-squares lines fit through the midpoints of sequential C a positions of two central P- 
strands. The sheet plane is defined as the least-squares plane fit through the C a positions of the 

15 residues of the sheet. The sheet axis is defined as the vector perpendicular to the sheet plane that 
passes through the helix center. Q is the angle between the strand axis and the helix axis after 
projection onto the sheet plane; G is the angle between the helix axis and the sheet plane; h is the 
distance between the helix center and the sheet plane; o is the rotation angle about the helix axis. 
The super-secondary structure parameter values for native GP1 are Q = -26.49°, 6 = 3.20°, h = 

20 10.04 A and o= 0°. 

Figures 6A, 6B, 6C and 6D depict four supersecondary structure parameters for p/p protein 
interactions. Figures 6A and 6B are relevant to p barrel proteins; Figure 6C is relevant to p-sheet 
interactions. Figure 6A shows cniy three strands, and depicts R, the barrel radius; a, the tilt of the 
strands relative to the barrel axis; a, the distance from C a to C a along the strands; and b, the 

25 interstrand distance. Figure 6B shows the twist and coiling angles of the P-sheet, with residues A, 
B and C from one strand, residues D, E and F in strand 2, and residues G, H and I from strand 3. 
The circles represent the positions of the residues when projected onto the surface of the barrel. In 
this case, 8 is the mean twist of the P-sheet about an axis perpendicular to the strand direction, t is 
the mean twist of the p-sheet about an axis parallel to the strand direction, c is the mean coiling of 

30 the p-sheet along the strands, n »s the mean coiling of the p-sheet along a line perpendicular to the 
strands. Figure 6C depicts two 3-sheets, with the chain direction being shown with arrows. Figure 
6D depicts two P-sheets of distance h with angle 9 between the average strand vectors. There is 
also 4), perpendicular to vectors defining 9. 
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Figures 7A, 7B, 7C and 7D depict four supersecondary structure parameters a/a supersecondary 
structure parameters for a/a interactions, d is the distance between the helices and 6 is the angle 
between the axes of the helices, o is defined as the rotation around the helix axis. CI is the angle 
between two strand axes after projection onto a plane. In Figures 7C and 7D, the dark circle 
5 represents a view of the helix from the end. 

Figure fl depkis the total optimization time vs. valus of sorting method for the (a) mixed structural 
type and (b) p- sheet surface benchmark cases. Sorting is determined by the value of the factor fin 
Eq. 5. The ssses exhibit different dependencies on tha value of the sorting factor, but both have 
minima in the vicinity of f ~ 0.08. This trend is observed for all cases (not shown). 

10 Figure S depicts the benchmark times of B&T versus ether combinatorial search algorithms. 

Figure 10 dsplcts the optimization tirr.es resulting from the combination of B&T {hashed bars) and 
DEE (solid bars) algorithms. The bars on the extreme left and right of the figure are the times for 
lone B&T and DEE optimization, respectively. The remaining bars are the cumulative B&T and 
DEE optimization times when the two algorithms are used in succession. The sudden jumps in 
15 DEE times arise from lengthy Goldstein doubles calculations. 

DETAILED DESCRIPTION OF THE INVENTION 

The present invention is directed to the quantitative design and optimization of amino acid 
sequences, using an "inverse protein folding" approach, which seeks the optimal sequence for a 
desired structure. Inverse folding is similar to procein design, which seeks to find a sequence or set 
20 of sequences that will fo!d into a desired structure These approaches can be contrasted with a 
"protein folding" approach which attempts to predict a structure taken by a given sequence. 

The general preferred approach of the present invention is as follows, although alternate 
embodiments are discussed below. A known protein structure is used as the starting point. The 
residues to be optimized are then identified, which may be the entire sequence or subset(s) 

25 thereof. The side chains of any positions to be varied are then removed. The resulting structure 
consisting of the protein backbone and the remaining sidechains is called the template. Each 
variable residue position is then preferably classified as a core residue, a surface residue, or a 
boundary residue; each classification defines a subset of possible amino acid residues for the 
position (for example, core residues generally will be selected from the set of hydrophobic 

30 residues, surface residues generally will be selected from the hydrophilic residues, and boundary 
residues may be either). Each amino acid can be represented by a discrete set of all allowed 
cenfermers of each side chain, called rotarners Thus, to arrive at an optimal sequence for a 

5 
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backbone, all possible sequences of rotamers must be screened, where each backbone position 
can be occupied either by each amino acid in all its possible rotameric states, or a subset of amino 
acids, and thus a subset cf rotams.s. 

Two sets of interactions are then calculated for each rotamer at every position: the interaction of 
5 the rotamer side chain with all or part of the backbone (the "singles" energy, also called the 

rotamer/ternplate or rotamer/backbone energy), and the interact.cn of the rotamer side chain with 
all ether possible rotamers at every other position or a subset of the other positions (the "doubles- 
energy, also called the rotamer/rotamer energy). The energy of each of these interactions is 
calculated through the use of a variety of scoring functions, which include the energy of van der 
1 o Waal's forces, the energy of hydrogen bonding, the energy of secondary structure propensity, the 
energy of surface area solvation and the electrostatics Thus, the total energy of each rotamer 
interaction, both with the backbone and other rotamers. is calculated, and stored in a matrix form. 
Thus, a sample matrix is generated for the singles calculation, and for the doubles calculations. 

The discrete nature of rotamer sets allows a simple calculation of the number of rotamer 
1 5 sequences to be tested. A backbone of length n with m possible rotamers per position will have m" 
possible rotamer sequences, a number which grows exponentially with sequence length and 
renders the calculations either unwieldy or impossible in real time. Accordingly, to solve this 
combinatorial search problem, either a "Dead End Elimination' (DEE) calculation or a "Branch and 
Terminate" (B&T) calculation, or both, are performed. The DEE calculation is based on the fact 

20 that if the worst total interaction of a first rotamer is still better than the best total interaction of a 

second rotamer. then the second rotamer cannot be part of the global optimum solution. Since the 
energies of all rotamers have already been calculated, the DEE approach only requires sums over 
the sequence length to test and eliminate rotamers. which speeds up the calculations considerably. 
DEF. can be rerun comparing pairs of rotamers, or combinations of rotamers, which will eventually 

25 result in the determination of a single sequence which represents the global optimum energy. 
Alternatively, a 3&T calculation can be done, as is more fully described below. 

Once the global solution has been fCLnd. a Monte Carlo search may be done to generate a rank- 
ordered list of sequences in the neighborhood of the DEE or B&T solution. Starting at the DEE or 
B&T solution, random positions are changed to other rotamers, and the new sequence energy is 
30 calculated. If the new sequence meets the criteria for acceptance, it is used as a starting point for 
another jump. After a predetermined numbei of jumps, a rank-ordered list of sequences is 
generated. 

B&T nay also be used to generate a rank ordered list of sequences in the neighborhood of the 
DEE or B&T solution. In fact, this search may be perfprmed without prior knowledge of the DEE or 
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B&T solution. The results mav; then be experiments'!*/ verif ed by physically g~ner?t ; rg one or 
more of the ptote'r; sequences fcilow-ricl by experimental testing The information obtained from the 
testing can then be fed back into the ^a'ysis. to mocVfy the procedure if necessary. 



Thus, the present mventior provides a cGir:pi;te>ass i sterf method of designing a protein. The 
5 method courses provjjiMjj si protein backbone structure with variable -esidue positions, and then 
establishing a group of potential foiamers for each cf the residue positions. As used herein, the 
backbone, oi template, includes the backbone atoms and any fixed side chains. The interactions 
between th« protein backbone a;vi the potential rotamsrs, and between pairs of ihe potential 
rotamers, are then processed to generate -3 set of optimized protein sequences, preferably a single 
10 global optimum, which then may be osed to generate 'ither related sequence? 

Figure 1 illustrates an automated protein design apparatus 20 in accordance with an embodiment 
of the invention. The apparatus 20 includes a central processing unit 22 which communicates with 

a memory 24 and a seX of inpul'ouipui devices (e g., keyboard, mouse, monitor, printer, etc.) 26 
through a bus 25. The genera! interaction between a central processing unit 22, a memory 24, 
15 input/output devices 26, and y bus 28 i?- known in the art. The present invention <s directed toward 
the automated protein design program 30 stored in the memory 24. 

The automated protein design program 30 may be implemented with a side chain module 32. As 
discussed in detail below, the side chain module establishes a group of potential rotamers for a 
selected protein backbone structure. The protein design program 30 may also be implemented 

20 with a ranking module 34. As discussed in detail below, the ranking module 34 analyzes the 
interaction of rotamers with the protein backbone structure to generate optimized protein 
sequences. The protein design program 30 may also include a search module 36 to execute a 
search, for example a Monte Carlo search as described below, in relation to the optimized protein 
sequences. Finally, an assessment module 38 may also be used to assess physical parameters 

25 associated with the derived proteins, as discussed further below. 

The memory 24 also stores a protein backbone structure 40, which is downloaded by a user 
through the input/output devices 26. The memory 24 also stores information on potential rotamers 
derived by the side chain module 32. In addition, the memory 24 stores protein sequences 44 
generated by the ranking module 34. The protein sequences 44 may be passed as output to the 
30 input/output devices 26. 



The operation of the automated protein design apparatus 20 is more fuiiy appreciated witn 
reference to Fig. 2. Fig. 2 illustrates processing steps executed in accordance with the method of 
the invention. As described below, many of the processing steps are executed by the protein 
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™ The first processing step illustrated in Fig. 2 is to provide a p-otein backbone 
design program 30. The first process.ng p downloa ded through 

structure (step 50). As previously .nd.cated, the prote.n backbone 
the input/output devices 26 using standard techniques. 

The protein backbone structure corresponds to a se.ected protein. By "protein" herein is meant at 
5 ZZe^o acids .inked together by a peptide bond. As used herein, protein includes protem, 
ZZ*e* -d peptides. The peptidy. group may comprise natura„y occurring am.no acds and 
« vletic peptidomimetic structures, i.e. "ana.ogs", such as peptoids (see Simon 
PNAS USA BS<20):9367 (1*2,,.. The amino acids may either be natura,, occunng or non- 

■ k„ thr.QP in the art any structure for which a set of 
naturallv occuring; as will be appreciated by those in tne an, any 

,0 ,ZZ is Known or can be oenerated « be used as an am,no acid The side chains ma, " 
le, the <R, or the (S) configuration .» a preferred embodiment, the amino acds ana ,n the ,S, or 

(L) configuration. 

The chosen protein ma, be an, pro,e,n ,o, which a three dimensional structure ,s Known or can be 
generated, that is. for which .here are three dimensional coordinates f or each e,om o ^ 
15 Leral,, «. can be de,e,m,ned -g X-ra, olographic ^e~ * 

„ 0 v„ modelling, homolog, modelling, etc. In genera,. I, X-ra, structures are used, structures a. 
resolution or Better are preferred, but not required. 

The proteins n,a, Pe from any organism, including promotes and eu k an-o.es. w» , enz,mes from 
pactena. fungi, extremeophiles sucn as the archebacteria. insects, f,sh. animals ( pa rt ,cu,ari, 
20 mammals and particularly human) and birds all possible 

Suitabie proteins include, bu, are no. limited to. industrial and pharmaceutic* proteins, inching 
ugands. L surface receptors, antigens. anybodies. c„oK,nes. hormones, and enzymes Su«ab e 
Ises o, enzymes .nclude. bu. are no. limited ,0. hydrolases such as proteases. carbohydrases. 
Upases isomerases such as racemases. epimerases. tautomerases, or mutases; transferases. 
25 Ises, o*ido,educ,ases. and phophatases. Suable enzymes are listed in the Swiss-Pro. enzyme 



database. 



Suitable protein backbones include, but are not limited to. a., of those found in the protein data 
base compiled and serviced by the Brookhaven National Lab. 

Specific^ included within "protein" are fragments and domains of known proteins, inc.uding 
30 functions domains such as enzymatic domains, binding domains, etc., and smaller fragments, 
such as turns, .cops. etc. That is, portions of proteins may be used as we... In addition, protein 
variants, i.e. non-naturally occuring variants, may be used. 
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Once the protein is chosen, the protein backbone structure is input ^nto a*.e- computer. By "protein 
backbone structure" or grammatical equivalents herein is meant tro three dimensional coordinates 
that define t ie th;ee ciirnansionai strurt,?r= of a particular protein. The structures which comprise a 
protein bzcktc.-Af structure (cf a naturaiiy orxurlng protein) are the nitrogen, the carbonyl carbon, 
5 the ct-csrbon, and the carbonyl oxygen, along with the direction of trie vector vrom the a-carbon to 
the p-carbon. 

The proteir backbone structure which is input into the computer can sithe * ir.ciude the coordinates 
for both the backbone and the amine acid side chains, cr just the bacKbcne, i.e. with the 
coordinates for the ai7;;i>o acid side chains removed. !f the former is done, the side chain atoms of 
10 each amino acid of the protein structure may be "stripped" or removed from the structure of a 

protein, as is known in the art, leaving enly the coordinates for the "backbone" atoms (the nitrogen, 
carbonyl carbon and oxygen, and the a-carbon, and the hydrogens attached to the nitrogen and a- 
carbon). 



15 



In a preferred embodiment, the protein backbone structure is altered prior to the analysis outlined 
below, in this embodiment, the representation of the starting protein backbone structure is reduced 
to a description of the spatial arrangement of its secondary structural elements. The relative 
positions of the secondary structural elements are defined by a set cf parameters called 
supersecondary structure parameters. These parameters are assigned values that can be 
systematically or randomly varied to alter the arrangement of the secondary structure elements to 
20 introduce explicit backbone flexibility. The atomic coordinates of the backbone are then changed to 
reflect the altered supersecondary structural parameters, and these new coordinates are input into 
the system for use in the subsequent protein design automation. 

Basically, a protein is first parsed into a collection of secondary structural elements which are then 
abstracted into geometrical objects. For example, as more fully outlined below, an a-helix is 

25 represented by its helical axis and geometric center. The relative orientation and distance between 
these objects are summarized as super-secondary structure parameters. Concerted backbone 
motion can be introduced by simply modulating a protein's super-secondary structure parameter 
values. Accordingly, when all or part of the Dackbone is to be altered, the portion to be altered is 
classified as belonging to a particular supersecondary structure element, i.e. a/3, a/a or (J/p, and 

30 then the supersecondary structural elements as outlined below are altered. As will be appreciated 
by those in the art, these elements need not be covaiently linked, i.e. part of the same protein; for 
example, this can be done to evaluate protein-protein interactions. 

As will be appreciated by those in the art, it is possible to alter the backoone of certain positions, 
white retaining either a particular amino acid (which can be "floated", as outlined below) or a 
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particular rotamer at the position; alternatively, both the backbone can be moved and the amino 
acid side chain can be optimized as outlined herein. Similarly, the backbone can be held constant 
and only the am>no acid side chains are optimized. Combinations of any of these at any position 
may be done. In general, when supersecondary structural parameters are altered, this is done on 
5 more than one amino acid. i.e. the backbone atoms of a plurality of amino acids that contribute to 
the secondary structure are moved. 

As will be appreciated by those in the art, there are a wide variety of different supersecondary 
structure parameters that can be used. Super-secondary structure parameterization has been 
described for fold classes that include a/a (Crick FHC The Fourier transform of a coiled-coil. Acfa 
10 Crystallogr £685-689 (1953a); Crick FHC. The packing of a-helices. Acta Crystallogr 6.689-697 
(1953b); Chothia et al., Proc Natl Acnd Sci USA 78:4146-4150 (1981) "Relative orientation of 
close-packed ^pleated sheets in proteins" and Chothia et al.. J Mol Biol M5:215-250 (1981) "Helix 
to helix packing in proteins"; Chou, et al. Energetics of the structure of the four-a-helix bundle in 
proteins. Proc Natl Acad Sci USA 85:4295-4299 (1988); Murzin AG. Finkelstein AV. General 
1 5 architecture of the a-hefea! globule. J Mo! Bio! 204:749-769 (1 938). Presnell SR. Cohen FE. 
Topological distribution of four-a-he!ix bundles. Proc Nail Acad Sci USA 86:6592-6596 (1989); 
Harris et al. Four helix bundle diversity in globulsr proteins. J Mol Biol 235.1356-1368 (1994) . a/p 
(Chothia et al.. Structure of proteins: packing of a-helices and pleated sheets. Prcc Natl Acad Sci 
USA 74:4130-4134 (1977); Janin & Chothia, 1980 Packing of a-helices onto ^-pleated sheets and 
20 the anatomy of aip proteins. J Mol Biol ?43:95~128; Cohen et al.. ^82. Analysis and prediction of 
the packing of a-helices against a P sheet in the tertiary structure of globular proteins. J Mol Biol 
1 56:821-862; Chou et a!., 1985. Interactions between an a-helix and £sheet energetics of aip 
packing in proteins. J Mol Biol 1 36:591-609, and B/(J (Cohen et al., Analysis and prediction of 
protein /*-sheet structures by a combinatorial approach. Nature 285:378-382 (1980); Cohen et al.. 
25 Analysis of the tertian/ structure of pratein £sheet sandwiches. J Mol Biol 148:253-272 (1981); 
Chothia & Janin, Relative orientation cf close-packed ^-pleated sheets in proteins. Proc Natl Acad 
Sci USA 78:4146-4150 (1981); Chothia & Janin, Proc Natl Acad Sci USA 78:3955-3965 (1982) 
Orthogonal packing of ^-pleated sheets in proteins; Chou et al., J Mol Biol 788:641-649 (1986) 
"Interactions between two p-sheets energetics of OP packing in proteins"; Laster et al.. Proc Natl 
30 Acad Sci USA 85:3338-3342 (1 988)"Structure principles of parallel ^-barrels in proteins"; Murzin et 
al.. J Mol Biol 236:1369-1381 (19943), "Principles determining the structure of /3-sheet barrels. I. A 
theoretical analysis"; Murzin et al. J Mol Biol 236: 1382---! 400 (1994b) "Principles determining the 
structure of /3-sheet barrels. II. The observed structures"; all of these references are explicitly 
incorporated by reference herein in their entireity). 

35 Four different supersecondary structure parameters useful for a/p proteins are shown in Figure 5. 
In a preferred embodiment, as *or all the supersecondary structure parameters, at least one of 
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these parameter values is altered; other embodiments utilize simultaneous or sequential alteration 
of two, three or four of these parameter values. 

For the a/p protein interactions, the heiix center is defined as the average C a position of the 
residues chosen for backbone movement. The helix axis is defined as the principal moment of the 
5 C a atoms of these residues (see Chothia et al. t 1981, supra). The strand axis is defined as the 
average of the least-squares lines fit through the midpoints of sequential C a positions of the two 
central P-strands. The sheet plane is defined as the least-squares plane fit through the C a 
positions of the two central P-strands. The sheet axis is defined as the vector perpendicular to the 
sheet plane that passes through the helix center. O is the angle between the strand axis and the 

10 helix axis after projection onto the sheet plane; 6 is the angle between the helix axis and the sheet 
plane; h is the distance between the helix center and the sheet plane; o is the rotation angle about 
the helix axis. Backbone alteration requires altering at least one of these parameter values. In a 
preferred embodiment, the supersecondary structure parameter value Q is altered by changing the 
angle degree (either positively or negatively) of up to about 25 degrees, with changes of + 1 *, 2.5*. 

15 5*, 7.5*, and 10* being particularly preferred. In a preferred embodiment, the supersecondary 

structure parameter value 6 is altered by changing the angle degree (either positively or negatively) 
of up to about 25 degrees, with changes of + 1*. 2.5*. 5*. 7.5*, and 10* being particularly preferred, 
the supersecondary structure parameter value o is altered by changing the angle degree (either 
positively or negatively) of up to about 25 degrees, with changes of + 1*. 2.5*. 5*. 7.5*, and 10* 

20 being particularly preferred. In a preferred embodiment, the supersecondary structure parameter 
value h is altered by changes (either positive or negative) of up to about 8 A, with changes of + 
0.25, 0.50, 0.75, 1.00, 1.25 and 1.5 being particularly preferred. However, as will be appreciated by 
those in the art, as for all the parameter values outlined herein, larger changes can be made, 
depending on the protein (i.e. how close or far other secondary structure elements are) and 

25 whether other parameter values are made; for example, larger changes in Q can be made if the 
helix is also moved away from the sheet (i.e. h is increased). 

Four different supersecondary structure parameters useful for a/a proteins are shown in Figure 7. 
As for a/p parameters, the helix center is defined as the average C a position of the residues in the 
helix. The helix axis is defined as the principal moment of the C 3 atoms of the residues in the helix. 
30 o is defined as the rotation around the helix axis. Q is the angle between two strand axes after 
projection onto a plane. Thus, d, the distance between the helices, can be altered, generally as 
outlined above for h. Similarly, 9, o and Q can be altered as above. 

There are a number of different supersecondary structure parameters useful for p/p proteins, p- 
barrel configurations contain a number of different parameters that can be altered, as shown in 
35 Figure 6. These include: (see Figure 6A) R. the barrel radius: or the angle of tilt of the strands 
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relative to the barrel axis; and b, the interstrand distance; (see Figure 6B) 0, the mean twist of the 
P-sheet about an axis perpendicular to the strand direction; t, the mean twist of the 3-sheet about 
an axis parallel to the strand direction; e the mean coiling of the P-sheet along the strands; n. the 
mean coiling of the p-sheet along a line perpendicular to the strands; and (Figure 6C) Q is angle 
5 between the two P-sheet axes. As for the a/p parameter values, each of these may be altered, 
either positively or negatively. Generally, changes are made in at least one of these parameter 
values, by changing the angle degree (either positively or negatively) of up to about 25 degrees, 
with changes of + 1*. 2.5*. 5", 7.5*. and 10* being particularly preferred, b can be changed up to + 
1 A. For p sandwich structures (Figure 6C and 6D), Q can be altered up to + 45*. with changes of 
10 + r, 2.5*, 5*. 7.5*, and 10" being particularly preferred. Similarly, h can be altered as outlined 
above for a/p proteins, and 6 and cj> can be altered up to + 30". 

Once the desired value changes are selected, the coordinate positions for the positions chosen are 
altered to reflect the change, to form a "new" or "altered" backbone protein structure, i.e. one that 
has all or part of the backbone atoms in a different position relative to the starting structure. It 
15 should be noted that this process can be repeated, i.e. additional backbone changes can be made, 
on the same or different residues. In addition, after optimization, the backbone of one or more 
optimal sequences can altered and an optimization can be run. 

Alternatively, movement of the backbone can be done manually, i.e. sections of the protein 
backbone can be randomly or arbitrarily moved. In this embodiment, the backbone atoms of one or 
20 more amino acids can be moved some distance, generally an angstrom or more, in any direction. 
This can be done using standard modeling programs; for example, Molecular Dynamics 
minimization, Monte Carlo dynamics, or random backbone coordinate/angle-motion. It is also 
possible to move the backbone atoms of single residues, that are either components of secondary 
structural elements or not. 

25 Once a protein structure backbone is generated (with alterations, as outlined above) and input into 
the computer, explicit hydrogens are added if not included within the structure (for example, if the 
structure was generated by X-ray crystallography, hydrogens must be added). After hydrogen 
addition, energy minimization of the structure is run, to relax the hydrogens as well as the other 
atoms, bond angles and bond lengths. In a preferred embodiment, this is done by doing a number 

30 of steps of conjugate gradient minimization (Mayo et ai, J. Phvs. Chem. 94:8897 (1990)) of atomic 
coordinate positions to minimize the Dreiding force field with no electrostatics. Generally from 
about 10 to about 250 steps is preferred, with about 50 being most preferred. 

The protein backbone structure contains at least one variable residue position. As is known in the 
art, the residues, or amino acids, of proteins are generally sequentially numbered starting with the 

12 



BNSDOCID <WO 011686?A2 I > 



WO 03/16862 PCT/US00/40805 

N-terminus of the protein Thus a protein having a methionine at it's N-terminus is said to have a 
methionine at residue or amino acid position 1, with the next residues as 2 3, 4. etc At each 
position, the wild type (i.e. naturally oocuring) protein may have one of at least ?0 amino acids, in 
any number of rotamers By "variable residue oosirion" herein is meant an amino acid position of 
5 the protein to be designed that is not fixed in the design method as a specific residue or rotamer, 
generally the wild-type residue or rotamer. 

In a preferred embodiment, all of the residue positions of the protein are variable That is, every 
amino acid side chain may be pltered in the methods of the present invention This is particularly 
desirable for smaller proteins, although the present methods allow the design of larger proteins as 
10 well. While there is no theoretical limit to the length of the protein which may be designed this way, 
there is a practical computational limit 

!n an alternate preferred embodiment, only some of the residue positions of the protein are 
variable, and the remainder are "fixed", that is. they are identified in the three dimensional structure 
as being in a set conformation. In some embodiments, a fixed position is left in its original 

1 5 conformation (which may or may not correlate to a specific rotamer of the rotamer library being 
used) Alternatively, residues may be fixed as a non-wild type residue; for example, when known 
site-directed mutagenesis techniques have shown that a particular residue is desirable (for 
example, to eliminate a proteolytic site or alter the substrate specificity of an enzyme), the residue 
may be fixed as a particular amino acid. Alternatively, the methods of the present invention may be 

20 used to evaluate mutations de novo ; as is discussed beiow. in an alternate preferred embodiment, 
a fixed position may be "floated"; the amino acid at that position is fixed but different rotamers of 
that amino acid are tested. In this embodiment, the variable residues may be at least one, or 
anywhere from 0.1% to 99.9% of the tota! number of residues. Thus, for example, it may be 
possible to change only a few (or one) residues, or most of the residues, with all possibilities in 

25 between. 

In a preferred embodiment, residues which can be fixed include, but are not limited to, structurally 
or biologically functional residues. For example, residues which are known to be important for 
biological activity, such as the residues which form the active site of an enzyme, the substrate 
binding site of an enzyme, the binding site for a binding partner (ligand/recsptcr, antigen/antibody, 
30 etc.), phosphorylation or glycosylate sites which are crucial to biological function, or structurally 
important residues, such as disulfide bridges, meta! binding sites, critical hyd r ogen bending 
residues, residues critical for backbone conformation such as proline or glycine, residues critical for 
packing interactions, etc. may all be fixed in a conformation or as 3 single r otamar, cr "floated". 
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Similarly, residues which may be chosen as variable residues may be those tnat confer 
undesirable biological attributes, such as susceptibility to proteolytic degradation, dimerization or 
aggregation sites, glycosylation sites which may lead to immune responses, unwanted binding 
activity, unwanted allostery, undesirable enzyme activity but with a preservation of binding, etc. 

5 As will be appreciated by those in the art, the methods of the present invention allow computational 
testing of "site-directed mutagenesis" targets without actually making the mutants, or prior to 
making the mutants. That is, quick analysis of sequences in which a small number of residues are 
changed can be done to evaluate whether a proposed change is desirable. In addition, this may be 
done on a known protein, or on an protein optimized as described herein. 

10 As will be appreciated by those in the art, a domain of a larger protein may essentially be treated 
as a small independent protein; that is, a structural or functional domain of a large protein may 
have minimal interactions with the remainder of the protein and may essentially be treated as if it 
were autonomous. In this embodiment, all or part of the residues of the domain may be variable. 

It should be noted that even if a position is chosen as a variable posiiion, it is possible that the 
1 5 methods of the invention will optimize the sequence in such a way as to select the wild type residue 
at the variable position. This generally occurs more frequently for core residues, and less regularly 
for surface residues. In addition, it is possible to fix residues as non-wild type amino acids as well. 

Once the protein backbone structure has been selected and input, and the variable residue 
positions chosen, a group of potential rotamers for each of the variable residue positions is 

20 established. This operation is shown as step 52 in Figure 2. This step may be implemented using 
the side chain module 32. In one embodiment of the invention, tne side chain module 32 includes 
at least one rotamer library, as described below, and program code that correlates the selected 
protein backbone structure with corresponding information in the rotamer library. Alternatively, the 
side chain module 32 may be omitted and the potential rotamers 42 for the selected protein 

25 backbone structure may be downloaded through the input/output devices 26. 

As is known in the art, each amino acid side chain has a set of possible conformers, called 
rotamers. See Ponder, et a/., Acad. Press Inc. (London) Ltd. pp. 775-791 (1987); Dunbrack, era/., 
Struc. Biol. 1(51:334-340 (1994V Desmet, et a/., Nature 356:539-542 (1992), all of which are 
hereby expressly incorporated by reference in their entireity. Thus, a set of discrete rotamers for 
30 every amino acid side chain is used. There are two general types of rotamer libraries: backbone 
dependent and backbone independent. A backbone dependent rotamer library allows different 
rotamers depending on the position of the residue in the backbone; thus for example, certain 
leucine rotamers are allowed if the position is within an a helix, and different leucine rotamers are 
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aiiowea if the position is not in a a-helix. A backbone independent rotamer library utilizes all 
rotamers of an ammo acid at every position. In general, a backbone independent library is 
preferred in the consideration of core residues, since flexibility in the core is important However, 
backbone independent libraries are computationally more expensive, and thus for surface and 
boundary posuions, a backbone dependent iibrary is preferred. However, either type of library can 
be usea at any position. 

In addition, a preferred embodiment does a type ot 'fine tuning" of the rotamer library by expanding 
the possible x <chi) angie values of the rotamers by plus and minus one standard deviation (or 
more) aboui the mean vaiue, in order to minimize possible errors that might arise from the 
discreteness of the library. This is particularly important for aromatic residues, and fairly important 
for hydrophobic residues, due to the increased requirements for flexibility in the core and the rigidity 
of aromatic rings; it is not as important for the otner residues. Thus a preferred embodiment 
expands the x, and x 2 angles for all amino acids except Met, Arg and Lys. 

To roughly illustrate the numbers of rotamers, in one version of the Dunbrack & Karplus backbone- 
dependent rotamer library, alanine has 1 rotamer, giycine has 1 rotamer, arginine has 55 rotamers, 
threonine has 9 rotamers, lysine has 57 rotamers, glutamic acid has 69 rotamers, asparagine has 
54 rotamers, aspartic acid has 27 rotamers, tryptophan has 54 rotamers, tyrosine has 36 rotamers, 
cysteine has 9 rotamers, glutamine has 69 rotamers, histidine has 54 rotamers, valine has 9 
rotamers, isoleucine has 45 rotamers, leucine has 36 rotamers, metmonme has 21 rotamers, 
serine has 9 rotamers, and phenylalanine has 36 rotamers. 

In general, proline is not generally used, since it will rarely be chosen for any 'position, although it 
can be included if desired. Similarly, a preferred embodiment omits cysteine as a consideration, 
only to avoid potential disulfide problems, although it can be included if desired. 

As will be appreciated by those in the art, other rotamer libraries with all dihedral angles staggered 
can be used or generated. 

In a preferred embodiment, at a minimum, at least one variable position has rotamers from at least 
two different amino acid side chains; that is, a sequence is being optimized, rather than a structure. 

In a preferred embodiment, rotamers from all of the amino acids (or all of them except cysteine, 
glycine and proline) are used for each variable residue position; that is, the group or set of potential 
rotamers at each variable position is every possible rotamer of each amino acid. This is especially 
preferred when the number of variable positions is not high as this type of analysis can be 
computationally expensive. 
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In a preferred embodiment, each variable oosition is classified as either a core, surface or 
boundary residue position, although in some cases, as explained below, the variable position may 
be set to glycine to minimize backbone strain. 

It should be understood that quantitative protein design or optimization studies prior to the present 
5 invention focused almost exclusively on core residues. The present invention, however, provides 
methods for designing proteins containing core, surface and boundary positions. Alternate 
embodiments utilize methods for designing proteins containing core and surface residues, core and 
boundary residues, and surface and boundary residues, as well as core residues alone (using the 
scoring functions of the present invention), surface residues alone, or boundary residues alone. 

10 The classification of residue positions as core, surface or boundary may be done in several ways, 
as will be appreciated by those in the art. In a preferred embodiment, the classification is done via 
a visual scan of the original protein backbone structure, including the side chains, and assigning a 
classification based on a subjective evaluation of one skilled in the art of protein modelling. 
Alternatively, a preferred embodiment utilizes an assessment of the orientation of the Ca-Cp 
1 5 vectors relative to a solvent accessible surface computed using only the template Co atoms. In a 
preferred embodiment, the solvent accessible surface for only the Ca atoms of the target fold is 
generated using the Connolly algorithm with a add-on radius ranging from about 4 to about 12A, 
with from about 6 to about 10A being preferred, and 8 A being particularly preferred. The Co 
radius used ranges from about 1.6A to about 2.3A, with from about 1.8 to about 2.1A being 
20 preferred, and 1 .95 A being especially preferred. A residue is classified as a core position if a) the 
distance for its Ca. along its Ca-Cp vector, to the solvent accessible surface is greater than about 
4-6 A, with greater than about 5.0 A being especially preferred, and b) the distance for its CP to the 
nearest surface point is greater than about 1 .5-3 A. with greater than about 2.0 A being especially 
preferred. The remaining residues are classified as surface positions if the sum of the distances 
25 from their Ca, along their Ca-Cp vector , to the solvent accessible surface, plus the distance from 
their Cp to the closest surface point was less than about 2.5-4 A. with less than about 2.7 A being 
especially preferred. All remaining residues are classified as boundary positions. 

Once each variable position is classified as either core, surface or boundary, a set of amino acid 
side chains, and thus a set of rotamers, is assigned to each position. That is, the set of possible 

30 amino acid side chains that the program will allow to be considered at any particular position is 
chosen. Subsequently, once the possible amino acid side chains are chosen, the set of rotamers 
that will be evaluated at a particular position can be determined. Thus, a core residue will generally 
be selected from the group of hydrophobic residues consisting of alanine, valine, isoleucine, 
leucine, phenylalanine, tyrosine, tryptophan, and methionine (in some embodiments, when the a 

35 scaling factor of the van der Waals scoring function, described below, is low, methionine is 
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removed from the sei), and the rotamer sf-t ?or each core position, potentially mercies roiamers for 
these eight smirvo acid .side chains the iciarners if a backbone independent lir^ary is used, and 
subsets if h rc-tamer dependem backbone is used). Similarly, surfs; e positions srs gen i rally 
selected from the group of hydrophilic residues consisting of alanioo, serine, threonine, aspartic 
5 acid, aspa/aglne, giuiamine, glutamic acid, arg^.ine. iy-^r.e and histictine. The rotarner set for each 
surface position ihus includes. ^terrors fos these ten resid jes Finally, boundary positions are 
generally chosen from alanine, serine, threonine, aspsrtic acid, aspartame, glutamine, glutamic 
acid, arginine, lysine histidine, valine, Isolfjucine, leucine, phenylalanine, tyrosine, tryptophan, and 
methionine. The roomer set for esch boundary position thus potentially includes every rotamer for 
10 these seventeen residues (assuming cysteine, glycine and proline are not used, although they can 
be). 

Thijs, as wiil be appreciated by these* In the an, there is a computational benefit to classifying the 
residue positions, as it decreases the number of calculations It should also be noted that there 
may be situations where the sets of core, boundary and surface residues are altered from those 

15 described above; for example, under some circumstances, one or more amino acids is either 
added or subtracted from the set of allowed amino acids, For example, some proteins which 
dimerize or muitimerize, or have ligand binding sites, may contain hydrophobic surface residues, 
etc. In addition, residues that do not allow helix "capping" o»* the favorable interaction with an a- 
helix dipole may be subtracted from a set of allowed residues. This modification of amino acid 

20 groups is done on a residue by residue basis 

In a preferred embodiment, proline, cysteine and glycine are not included in the itst of possible 
amino acid side chains, and thus the rotamers for these side chains are not used. However, in a 
preferred embodiment, when the variable residue position has a ct> angle (that is, the dihedral angle 
defined by 1) the carbonyl carbon of the preceding amino acid; 2) the nitrogen atom of the current 
25 residue; 3) the a -carbon of the current residue; and 4) the carbonyl carbon of the current residue) 
greater than 0*, the position is set to glycine to minimize backbone strain. 

Once the group of potential rotamers is assigned for each variable residue position, processing 
proceeds to step 54 of Figure 2. This processing step entails analyzing interactions of the 
rotamers with each other and with the protein backbone to generate optimized protein sequences. 
30 The ranking module 34 may be used to perform these operations. That is, computer code is written 
to implement the following functions. Simplistically, as is generally outlined above, the processing 
initially comprises the use of a number of scoring functions, described below, to calculate energies 
of interactions of the rotamers, either to the backbone itself or other rotamers. 
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The scoring functions include a Van der Waals potential scoring function, a hydrogen bond 
potential scoring function, an atomic solvation scoring function, a secondary structure propensity 
scoring function and an electrostatic scoring function. As is further described below, at least one 
scoring function is used to score each position, although the scoring functions may differ depending 
5 on the position classification or other considerations, like favorable interaction with an a-helix 
dipole. As outlined below, the total energy which is used in the calculations is the sum of the 
energy of each scoring function used at a particular position, as is generally shown in Equation 1: 

Equation 1 

E ( uiai = nE vdw + nE as + nEh^^g + nE S5 + nE etoc 

10 In Equation 1 , the total energy is the sum of the energy of the van der Waals potential (E^), the 
energy of atomic solvation (E„), the energy of hydrogen bonding (E^^g), the energy of 
secondary structure (E M ) and the energy of electrostatic interaction (E e(ec ). The term n is either 0 or 
1. depending on whether the term is to be considered for the particular residue position, as is more 
fully outlined below. 

15 In a preferred embodiment, a van der Wsa's' scoring function s used. As is known in the art, van 
der Waals' forces are the weak, non-oovslent and non-ionic forces between atoms and molecules, 
that rs t the induced dipole and electron repulsion (Pauli principle) forces. 

The van der Waals scoring function is based on a van der Waals potential energy. There are a 
number of van der Waals potential energy calculations, including a Lennard-Jones 12/6 potential 
20 with radii and wel! depth parameters from the Dreiding force field, Mayo et a' , J Prot. Chem. . 
1390, expressly incorporated herein by reference, or the exponential 6 potential. Equation 2, 
shown below, is the preferred Lennard-Jones potential 

Equation 2 




R 0 is the geometric mean of the van der Waals radii of the two atoms under consideration, and D 0 
25 is the geometric mean of the well depth of the two atoms under consideration. E vdw and R are the 
energy and interatorr ic distance between the two atoms under consideration, as is more fully 
described below. 

In a preferred embodiment, the van der Waals forces are scaled using a scaling factor, a. 
Equation 3 shows the use of a in the van der Waals Lennard-Jones potential equation: 



18 



BNSDOCID: <WO 0116862A2 I > 



WO Cl/16862 



Equation 3 



PCT/USOO/40805 




The role of the a scaling factor is to change the importance of packing effects in the optimization 
and design of any particular protein. Specifically, a reduced van der Waals steric constraint can 
compensate for the restrictive effect of a fixed backbone and discrete sid ?-ohain rotamers in the 
5 simulation and can allow a broader samp'ing o' sequences compatible with a desired fold In a 
preferred embodiment a values ra:iuir:g from about 0.70 to about 1.10 can be used, with a values 
from about 0.8 to about 1.05 being preferred, and from about 0.S5 to about 1.0 being especially 
preferred. Specie a. values which are preferred are 0.S0 r 0.35, 0.90, 0.05, 1.C0, and 1.05. 



Generally speaking, variation of the van de' Waals scale factor a results in four regimes of packing 
10 specificity: regime 1 where 0.9 < a * 1 .05 and packing constraints dominate the sequence 

selection; regime 2 where 0.8 s a < 0.9 and the hydrophobic solvation potential begins to compete 
with packing forces; regime 3 where a < 0.8 and hydrophobic solvation dominates the design; and, 
regime 4 where a > 1 .05 and van der Waals repulsions appear to be too severe to aibw 
meaningful sequence selection. !n particular, different a values may be used for core, surface and 
15 boundary positions, with regimes 1 and 2 being preferred for core residues, regime 1 being 
preferred for surface residues, and regime 1 and 2 being preferred for boundary residues. 

In a preferred embodiment, the van der Waals scaling factor is used in the total energy calculations 
for each variable residue position, including core, surface and boundary positions. 

In a preferred embodiment, an atomic solvation potential scoring functior. is used. As is 
20 appreciated by those in the art, solvent interactions of a protein are a significant factor in protein 
stability, and residue/protein hydrophobic^ has been shown to be the major driving force in protein 
folding. Thus, there is an entropic cost to solvating hydrophobic surfaces, in addition to the 
potential for misfolding or aggregation. Accordingly, the burial of hydrophobic surfaces within a 
protein structure is beneficial to both folding and stability. Similarly, there can be a disadvantage 
25 for burying hydrophilic residues. The accessible surface area of a protein atom is generally defined 
as the area of the surface over which a water molecule can be placed white making van der Waals 
contact with this atom and not penetrating any other protein atom. Thus, in a preferred 
embodiment, the solvation potential is generally scored by taking the total possible exposed 
surface area of the moiety or two independent moieties (eitht<- a rotamer or tne first rotamer and 
30 the second rotamer), which is the reference, and subtracting out the "buiieri" area, i.e. the area 
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which is not solvent exposed due to interactions either with the backbone or with other rotamers. 
This thus gives the exposed surface area. 

Alternatively, a preferred embodtment calculates the scoring function on the basis of the "buried" 
portion; i.e. the total possible exposed surface area is calculated, and then the calculated surface 
5 area after the interaction of the moieties is subtracted, leaving the buried surface area. A 
particularly preferred method does both of these calculations. 

As is more fully described below, both of these methods can be done in a variety of ways. See 
Eisenberg et a/., Nature 319:199-203 (1986); Connolly, Science 221:709-713 (1983); and Wodak, 
et a/., Proc. Natl. Acad. Sci. USA 77(4): 1736-1 740 (1980), all of which are expressly incorporated 
10 herein by reference. As will be appreciated by those in the art, this solvation potential scoring 
function is conformation dependent, rather than conformation independent. 

In a preferred embodiment, the pairwise solvation potential is implemented in two components, 
"singles" (roiamer/template) and "doubles" (rotamer/rotamer), as is more fully described below. For 
the rotamer/tempiate buried area, the reference state is defined as the rotamer in question at 

1 5 residue position i with the backbone atoms only of residues i-1 , i and i+1 1 although in some 
instances just i may be used. Thus, in a preferred embodiment, the solvation potential is not 
calculated for the interaction of each backbone atom with a particular rotamer, although more may 
be done as required. The area of the side chain is calculated with the backbone atoms excluding 
solvent but not counted in the area. The folded state is defined as the area of the rotamer in 

20 question at residue i, but now in the context of the entire template structure including non-optimized 
side chains, i.e. every other fixed position residue. The rotamer/tempiate buried area is the 
difference between the reference and the folded states. The rotamer/rotamer reference area can 
be done in two ways; one by using simply the sum of the areas of the isolated rotamers; the 
second includes the full backbone. The folded state is the area of the two rotamers placed in their 

25 relative positions on the protein scaffold but with no template atoms present. In a preferred 

embodiment, the Richards definition of solvent accessible surface area (Lee and Richards, J. Mol. 
Biol. 55:379-400, 1971, hereby incorporated by reference) is used, with a probe radius ranging 
from 0.8 to 1.6 A, with 1.4 A being preferred, and Drieding van der Waals radii, scaled from 0.8 to 
1.0. Carbon and sulfur, and all attached hydrogens, are considered nonpoiar. Nitrogen and 

30 oxygen, and all attached hydrogens, are considered polar. Surface areas are calculated with the 
Connolly algorithm using a dot density of 10 A-2 (Connolly, (1983) (supra), hereby incorporated by 
reference). 

In a preferred embodiment, there is a correction for a possible overestimation of buried surface 
area which may exist in the calculation of the energy of interaction between two rotamers (but not 

20 
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the interaction of a rotamer wi f h the backbone) Since, as is genera 1 !*/ outlined below, rotamers are 
only considered in pa» r s. tha* is, a tint rotamer is only compared to a second rof.?Ttsr during the 
"doubles" calculations, this may overestimate the amount cf buried surface area in locat'ons where 
more than two rotamers interact, that is, where rotamers from three or more residue positions 
5 come together. Thus, a correction or scaling factor is used as outlined below. 

The genera! energy of solvation is shown in Equation 4; 

Equation 4 
E M - f(SA) 

where E sa is the energy cf solvation, f is a constant used to correlate surface area and energy, and 
10 SA is the surface area. This equation can be broken down, depending on which parameter is 
being evaluated. Thus, when the hydrophobic buried surface area is used, Equation 5 is 
appropriate: 

Equation 5 

^SJ " fl(SAt>ur:8<l hydrophobic) 

15 where f, is a constant which ranges from about 10 to about 50 cal/mol/A 2 , with 23 or 26 cal/mol/A 2 
being preferred. When a penalty for hydrophilic burial is being considered, the equation is shown 
in Equation 6: 

Equation 6 

20 where f- is a constant which ranges from -50 to -250 ca!/mol/A 2 ( with -86 or -100 cal/mol/A 2 being 
preferred. Similarly, if a penalty for hydrophobic exposure is used, equation 7 or 8 may be used: 

Equation 7 

~~ ' l(S-^t>urted hyd op^obic/' "** ^(^Ag^^g hydrophobic) 

Equation 8 

^ ^» ~ ^(^^burwd hydrophobic) + ^(^^burwd hydrophilic) + ^(^^xposeH hydrophot»c) + ^(SA exposed hydroph(|lC ) 

In a preferred embodiment, f 3 = -f,. 

In one embodiment, backbone atoms are not included in the calculation of surface areas, and 
values of 23 cal/mol/A 2 (f,) and -86 cal/mol/A 2 (f 2 ) are determined. 

In a preferred embodiment, this overcounting problem is addressed using a scaling factor that 
30 compensates for only the portion of the expression for pairwise area that is subject to over- 
counting. In this embodiment, values of -26 cal/mol/A 2 (f t ) and 100 cal/mol/A 2 (f 2 ) are determined. 
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Atomic solvation energy is expensive, in terms of computational time and resources. Accordingly, 
in a preferred embodiment, the solvation energy is calculated for core and/or boundary residues, 
but not surface residues, with both a calculation for core and boundary residues being preferred, 
although any combination of the three is possible. 

5 In a preferred embodiment, a hydrogen bond potential scoring function is used. A hydrogen bond 
potential is used as predicted hydrogen bonas do contribute to designed protein stability (see 
Stickle era/., J. Mol. Biol. 226:1143 (1992); Huyghues-Despointes era/., Biochem. 34:13267 
(1995), both of which are expressly incorporated herein by reference). As outlined previously, 
explicit hydrogens are generated on the protein backbone structure. 

10 in a preferred embodiment, the hydrogen bond potential consists of a distance-dependent term and 
an angle-dependent term, as shown in Equation 9: 

Equation 9 



where R 0 (2.8 A) and D 0 (8 kcal/mol) are the hydrogen-bond equilibrium distance and well-depth, 
respectively, and R is the donor to acceptor distance. This hydrogen bond potential is based on 

15 the potential used in DREIDING w:th more restrictive angie-dependent terms to limit the occurrence 
of unfavorable hydrogen bond geometries. The angle term varies depending on the hybridization 
state of the donor and acceptor, as Si K own in Equations 10, 11, 12 and 13. Equation 10 is used for 
sp 3 donor to sp 3 acceptor; Equation 1 1 is used for sp 3 donor to sp 2 acceptor, Equation 12 is used 
for sp 2 donor to sp 3 acceptor, and Equation 13 is usea for sp 2 donor to sp 2 acceptor: 

20 Equation 10 




F =cos 



; 2 6cos 2 ((J)-109.5) 



Equation 1 1 



F = cos 



; 2 6cos 2 (}> 



Equation 12 



F=cos 4 0 
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Equation 13 
F = cos z 0 cos' (max [cj>, (p]) 

In Equations 10-13 ; 9 is the dor.or-hydrogen-acceptor angle, <}) is the ' v ydrcgen-acceptor-base 
angle (the base is the atom attached to the acceptor, for example the carbonyl carbon is the base 
for a carbony! o>yger accepto*), and (p =s che angle between the normals of the planes defined by 
5 the six atoms attached to the sp ? centers (the supplement of cp is used when tp ?s less than 90"). 
The hydrogen-bond function is only evaluated when 2,6 A «; R c 3.2 A, 0 > 90\ $ - 109.5" < 90* for 
the ?p 3 donor - so 3 accepter case. ind. 0 > GC* for the so 3 done - sp 2 acceptor case; preferably, no 
switching functions are used. Temp:?.te donors rsr>d acceptors that are involved in 
template -template Vydrogen bonds are preferably not included in the doner and acceptor lists. For 
10 the purpose of exclusion, a template-template hydrogen bonri is considered to exist when 2.5 A s 
R <; 3.3 A and 6 > 135". 

The hydrogen-bond potential may also be combined or used with a weak coulombic term that 
includes a distance-dependent dielectric constant of 4CR, where R is the interatomic distance. 
Partial atomic charges are preferably only applied to polar functions! groups. A net formal charge 
15 of +1 is used for Arg and Lys and a net formal charge of -1 is used for Asp and G!u; see Gasteiger, 
et a/., Tetrahedron 36:3219-3288 (1980); Rappe, et a/. ( J. Phvs. Chem. 95:3358-3363 (1991). 

!n a preferred embodiment, an explicit penalty is given for buried polar hydrogen atoms which are 
not hydrogen bonded to another atom. See Eisenberg, et a/., (1986) (supra), hereby expressly 
incorporated by reference. In a preferred embodiment, this penalty for polar hydrogen burial, is 
20 from about 0 to about 3 kcal/mol, with from about 1 to about 3 being preferred and 2 kcal/rnol being 
particularly preferred. This penalty is only applied to buried polar hydrogens not involved in 
hydrogen bonds. A hydrogen bond is considered to exist when E HB ranges from about 1 to about 4 
kcai/mol, with E HB of less than -2 kcal/moi Leing preferred. In addition, in a preferred embodiment, 
the penalty is not applied to template hydrogens, i.e. unpaireJ buried hydrogens of the backbone. 

25 In a preferred embodiment, only hydrogen bonds between p. first -otamer and the backbone are 
scored, and rotamer-rotamer hydrogen bonds are not scored. In an alternative embodiment, 
hydrogen bends between a first roiamer and the backbone are scored, and rotamer-rotamer 
hydrogen bonds are scaled by 0.5 . 

In a preferred embodiment, the hydrogen bonding scoring function is used for all positions, 
30 including core, surface and boundary positions. In alternate embodiments, the hycrogen bonding 
scoring function may be used on only one jr two of these. 
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In a preferred embodiment, a secondary structure propensity scoring function is used. This is 
based on the specific amino acid side chain, and is conformation independent. That is, each amino 
acid has a certain propensity to take on a secondary structure, either a-helix or P-sheet, based on 
its cj) and angles. See Munoz et a/., Current Op. in Biotech. 6:382 (1995); Minor, et a/., Nature 
5 367:660-663 (1994); Padmanafchan, etaS. t Nature 344:268-270 (1990); Munoz, ef a/., Folding & 
Design 1(3):167-178 (1996); and Chakrabartty, eta!.. Protein Sci. 3:843 (1994), all of which are 
expressly incorporated herein by reference. Thus, for variable residue positions that are in 
recognizable secondary structure in the backbone, a secondary structure propensity scoring 
function is preferably used. That is, when a variable residue position is in an a-helical area of the 
10 backbone, the a-helical propensity scoring function described below is calculated. Whether or not 
a position is in a a-helical area of the backbone is determined as will be appreciated by those in the 
art, generally on the basis of <}> and iy angles; for a-helix, <t> angles from -2 to -70 and y angles from 
-30 to -100 generally describe an a-helical area of the backbone. 

Similarly, when a variable residue position is in a P-sheet backbone conformation, the 3-sheet 
15 propensity scoring function is used, p-sheet backbone conformation is generally described by 4> 
angles from -30 to -100 and x angles from +40 to +180. In alternate preferred embodiments, 
variable residue positions which are within areas of the backbone which are not assignable to 
either p-sheet or a-heiix structure may also be subjected to secondary structure propensity 
calculations. 

20 In a preferred embodiment, energies associated with secondary propensities are calculated using 
Equation 14 

Equation 14 

N (AG° -ag° , ) 
E^IO" " -1 

In Equation 14, E a (or E p ) is the energy of a-helical propensity, AG* aa is the standard free energy of 
helix propagation of the amino acid, and AG' ala is the standard free energy of helix propagation of 

25 alanine used as a standard, or standard free energy of p-sheet formation of the amino acid, both of 
w^ich are available in the literature (see Chakrabartty, et a/., (1994) (supra), and Munoz, et a/., 
(1S96) (supra)), both of which are expressly incorporated herein by reference), and N ss is the 
propensity scale factor which is set to range from 1 to 4, with 2.0 being preferred. This potential is 
preferably selected in order to scale the propensity energies to a similar range as the other terms in 

30 the scoring function. 

In a preferred embodiment, P-sheet propensities are preferably calculated only where the i-1 and 
i+1 residues are also in P-sheet conformation. 
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In a preferred emboaiment, the secondary structure propensity scoring function is u^ed only in the 
energy calculations fcr surface variable residue position.-, fr; alternate errbed^m*:^. ; u irt 
secondary s;fjc*ur; propensity scum -> function is used in tho calculations for core and boundary 
regions as w^.l. 

5 In a preferred embodiment, an electrostatic scoring function is used, as shown below in Equation 
15: 

Equation 15 

E.*c - qq' 
er 2 

10 In this Equation, q is the charge on atom 1, q' is charge on atom 2, and r is the interaction distance. 

In a preferred embodiment, at least one scoring function is used for each variable residue position; 
in preferred embodiments, two. three or four scoring functions are used for each variable residue 
position. 

Once the scoring functions to be used are identified for each variable position, the preferred first 
1 5 step in the computational analysis comprises the determination of the interaction of each possible 
rotamer with all or part of the remainder of the protein. That is, the energy of interaction, as 
measured by one or more of the scoring functions, of each possible rotamer at each variable 
residue position with either the backbone or other rotamers, is calculated. In a preferred 
embodiment, the interaction of each rotamer with the entire remainder of the protein, i.e. both the 
20 entire template and all other rotamers, is done. However, as outlined above, it is possible to only 
model a portion of a protein, for example a domain of a larger protein, and thus in some cases, not 
all of the protein need be considered. 

In a preferred embodiment, the first step of the computational processing is done by calculating 
two sets of interactions for each rotamer at every position (step 70 of figure 3): the interaction of 
25 the rotamer side chain with the template or backbone (the "singles" energy), and the interaction of 
the rotamer side chain with all other possible rotamers at every other position (the "doubles" 
energy), whether that position is varied or floated. It should be understood that the backbone in this 
case includes both the atoms of the protein structure backbone, as well as the atoms of any fixed 
residues, wherein the fixed residues are defined as a particular conformation of an amino acid. 

30 Thus, "singles" (rotamer/template) energies are calculated for the interaction of every possible 
rotamer at every variable residue position with the backbone, using some or aH of the scoring 
functions. Thus, for the hydrogen bonding scoring function, every hydrogen bonding atom of the 
rotamer and every hydrogen bonding atom of the backbone is evaluated, and the E HB is calculated 
for each possible rotamer at every variable position. Similarly, for the van der Waals scoring 
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, u nc,,on every atom o, the reamer is compare, to every a,om o, the <**-» exclud.ng 

the baeKbone atoms o, » own residue,, and the E- is calculated «>< — > J*™" ' 

ev.ry variable rescue ccsihon. In addition, general no van der Waais energy , calculated ,Hhe 
11 are connected b, three bonds or tess. For the atomic solvation sconng tuncaon. the surface 
5 o. me roomer is measured against the surface o, me template, and the E„ .or each poss,Ue 
rotamer a. every variabie residue position is caicuiated. The secondary structure propensity 
scoring .unction is also considered as a singles energy, and thus me totai singles energy may 
contai an E„«erm. As will be appreciated by mcsein me art. many o, these energy terms wl be 
Tose to zero depending on the physlcat distance between the rotamer and me template P os,t,on; 
10 mat is. the farther apart the two moieties, the closer to zero. 

According*, as outlined above, the total s,ng,es energy is the sum o, me energy o, each scoring 
tuncuon used at a partcular position, as shown in Equation 1 . where.n n is either , or zero, 
depending on whether that parfcular scoring function was used a, the totamer posrtion: 

Equation 1 

1 5 E Mai = nE vOW + nE„ + nE,.^ + nE 5S + nE.^ 

Once ca,cu,ated, each s.ng.es E tota , for each possib.e rotamer ,s stored in the memory 24 within the 
computer, such that it may ne used ,n subsequent calculations, as outlined below. 

for the calcu.at.on of "doub.es" energy (rotamer/rotamer). the .nteraction energy of each possible 
rotamer is compareo with eve^ possib.e rotamer at all other variable residue positions. Thus^ 
20 'doubles" energies are ca.cu.ated for the interact.cn of every P ossib.e rotamer at every vanab.e 

residue position with ever, possib.e rotamer at every other variab.e residue position, us.ng some or 
a., of the scoring functions. Thus, for the hydrogen bond.ng scoring function, every hydrogen 
bonding atom of the first rotamer and every hydrogen bonding atom of every possib.e second 
rotamer is eva.uated, and the E HB is ca.cuiated for each possib.e rotamer pair for any two vanab.e 
25 positions. Simi.ar.y, for the van der Waa.s scoring function, every atom of the first rotamer ,s 
compared to every atom of every possible second rotamer. and the E. dW is ca.cu.ated for each 
possible rotamer pair at every two variabie residue positions. For the atomic solvation sconng 
function, the surface of the frst rotamer is measured against the surface of every possib.e second 
rotamer, and the E as for each possible rotamer pair at every two variab.e residue positions ,s 
30 ca.cu.ated. The secondary structure propensity scoring function need not be run as a "doubles 

energy as it is considered as a component of the "sing.es" energy. As wi.. be appreciated by those 
in the art, many of these doub.e energy terms wi.. be dose to zero, depending on the phys.ca. 
distance between the first rotamer and the second rotamer; that is. the farther apart the two 
moieties, the closer to zero. 
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Accordingly, as outlined 3bove, the total doubles energy is the sum o* th^ entity :rf each scoring 
function used to evaluate ever)- possible pair of rc*.s:ns:?.. as shown x- Equation *»6, w!" srein n is 
either 1 or zero, depending on whether that particular scoring function was used at the rctornor 
position: 

Equation 16 



^-to;a! *~ N^vcSw * ""'^as f. -bonding '* C« 



An example is illuminating A first variable position, i has three (an unrs&iistically low number) 
possible rotarners (which may be either from a single amino acid or different amino acids) which 
are labelled L, i bf and L A second variable position, j : also has three possible rotarners, labelled j d 
j e , and j, Thus, nine doubles energies (E^) are calculated In ali: E tota ,(i a . jj, E l0X J\ at ,J, E total (! a , j f ), 
E to tai(ib. j d ). E total (i bt j e ), E total (i bf j f ), E, otal (i c , j d ), E lolal (i C( j e ), and E total (i c , j f ). 

Once calculated, each doubles E touj for each possible retainer pair is stored in memory 24 within 
the computer, such that it may be used in subsequent calculations, as outlined below. 



Once the singles and doubles energies are calculated and stored, the next step of the 
15 computational processing may occur. Generally SDeaking, the goal of the computational 

processing is to determine a set of optimised protein sequences. By "optimized protein sequence" 
herein is meant a sequence that best fits the mathematical equations herein. As will be 
appreciated by those T: the art a global optimized seqij(-ncv is the one sequence that best fits 
Equation 1 t i.e. the sequence that has the lowest energy cf any possible sequence. However, 
20 there are any number of sequences th.3t are not the global minimum but that have low energies. 

In a preferred embodiment, th? set comprises the gbbaily optima! sequence in its optimal 
conformation, i.e. the optimum rotamer at each variable position. That is, computational processing 
is run until the simulation program converges on a single sequence which is the global optimum. 

In a preferred embodiment, the set comprises at least two optimized protein sequences. Thus for 
25 example, the computational processing step may eliminate a number of disfavored combinations 
but be stopped prior to convergence, providing a set of sequences of which the global optimum is 
one. In addition, further computational analysis, for example using a different method, may be run 
on the set, to further eliminate sequences or rank them differently. Alternatively, as is more fully 
described below, the global optimum may be reached, and then further computational processing 
30 may occur, which generates additional optimized sequences in the neighborhood of the global 
optimum. 
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If a set comprising more than one optimized protein sequences is generated, they may be rank 
ordered in terms of theoretical quantitative stability, as is more fully described below. 

In a preferred embodiment, the computational processing step first comprises an elimination step, 
sometimes referred to as "applying a cutoff, either a singles elimination or a doubles elimination. 

5 Singles elimination comprises the elimination of all rotamers with template interaction energies of 
greater than about 10 kcal/mol prior to any computation, with elimination energies of greater than 
about 15 kcal/mol being preferred and greater than about 25 kcal/mol being especially preferred. 
Similarly, doubles elimination is done when a rotamer has interaction energies greater than about 
1 0 kcal/mol with all rotamers at a second residue position, with energies greater than about 1 5 

1 0 being preferred and greater than about 25 kcal/mol being especially preferred. 

In a preferred embodiment, the computational processing comprises direct determination of total 
sequence energies, followed by comparison of the total sequence energies to ascertain the global 
optimum and rank order the other possible sequences, if desired. The energy of a total sequence 

is shown below in Equation 17: 
.j 5 Equation 17 

^totalprotein ^(b-b) ^ ^0,) ^ a u^a irs 

Thus every possible combination of rotamers may be directly evaluated by adding the backbone- 
backbone (sometimes referred to herein as template-template) energy (E,^, which is constant over 
all sequences herein since the backbone is kept constant), the singles energy for each rotamer 
(which has already been calculated and stored), and the doubles energy for each rotamer pair 
20 (which has already been calculated and stored). Each total sequence energy of each possible 

rotamer sequence can then be ranked, either from best to worst or worst to best. This is obviously 
computationally expensive. and becomes unwieldy as the length of the protein increases. 

Thus, as outlined herein, the computational processing includes one or more Branched & 
Terminated (B&T) computational steps as outlined below, and optionally a DEE step, also outlined 
25 below. 

Accordingly, the present invention provides a novel deterministic combinatorial search algorithm, 
called "Branch and Terminate" (B&T) derived from the Branch -and-Bound search method. The 
B&T approach is based on the construction of an efficient, but very restrictive bounding expression, 
which is used for the search of a combinatorial tree representing the protein system. The bounding 
30 expression is used both to determine the optimal organization of the tree and to perform a highly 
effective pruning procedure named "termination." For some calculations, the B&T method rivals 
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the current deterministic standard, Dead-End Elimination (DEE), sometimes finding the solution up 
to 21 times faster. A more significant feature of B&T algorithm is that ii can provide an efficient way 
to complete the optimization of problems that have been partially reduced by a DEE algorithm. 

The B&T algorithm is an effective optimization algorithm when used alone. Moreover, it can 
5 increase the problem size iimit of ammo acid side chain placement calculations, such as protein 
design, by completing DEE optimizations that reach a point at which the DEE criteria become 
inefficient Together the two algorithms make it possible to find solutions to problems that are 
intractable by either aigoritnm alone. 

In a preferred embodiment, B&T is used wnen DEE algorithms are not sufficient, due either to the 
1 0 nature of their energy distributions or their sheer size. For example, the optimization of long 
hydrophilic side chains on p-sheets is typically composed of large numbers of rotamers with 
interaction energies that are very smali in magnitude. DEE is able to reduce the combinatorial size 
of the problem significantly at the outset, but soon after, elimination becomes inefficient, relying 
entirely on computationally expensive DEE doubles calculations ( Lasters, I. & Desmet, J., Prot. 
15 Eng. 6, 717-722 (1993); Gordon, D.B. & Mayo, S.L, J. Comp. Chem. 19, 1505-1514 (1998)). This 
behavior is also observed in the later stages of very large calculations, when after several rounds 
of unification further eliminations become difficult and the number of super-rotamers at super- 
residue positions becomes very large (Desmet, J., et al., in The Protein Folding Proolem and 
Tertiary Structure Prediction. Merz Jr., K. & Le Grand, S. Ed., Birkhauser, Boston, p. 307 (1994)). 
20 To complete such calculations, a technique consisting of exhaustive combinatorial build-up aided 
by DEE has been described (De Maeyer, M., et aL, Folding & Design, 2, 53-56 (1997)). However, 
since the effectiveness of the elimination criteria is poor in these cases, it is advantageous to 
construct a method that is not dependent on them. 

To address these difficult optimization problems, the invention provides u Branch-and-Terminate" 
25 (B&T) methods, based on a "Branched & Bound" (B&B) algorithm. B&B algorithms comprise a 

sub-class of backtrack algorithms that utilize information about costs (or energies) of complete and 
partial solutions. Backtrack algorithms are commonly used in atomic-level simulations to construct 
self-avoiding chains, and they have been used in protein design to engineer metal binding sites into 
proteins (Hellinga, H.W. & Richards, F.M., J. Mol. Biol. 222, 763-785 (1991)). 

30 B&B algorithms are commonly applied to theoretical combinatorial and scheduling problems, and 
more recently to combinatorial problems of structural biology ranging from sequence alignment 
(Lathrop, R.H. & Smith, T.F., J. Mol. Biol. 255, 641-665 (1996)), and structural comparison 
(Escalier, V., et al., J Comp. Biol. 5, 41-56 (1998)), to macromolecular packing (Wang, C.S.E., et 
al. t Proteins, 32, 26-42 (1998)), ligand design (Todorov, N.P. & Dean, P.M., J. Comp. Aid. Mol. 
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Des 12 335-349 (1998)). and recently, protein tertiary structure prediction (Eyrich, V. A. et al„ 
Proteus 35 41-57(1999)). Toward the study of protein side-chains, Samudrala and Moult 
(Samudra.a R & Moult, J., J. Mol. Biol. 279, 287-302 (1998)) have described a graph-theorefc 
approach to the closely related problem of comparative mode.ing, m which they represent the 
search as a clique-finding problem, which they solve using a B&B algorithm. In addition, Leach 
and Lemon (Leach, A.R. & Lemon. A.P., Proteins. 33, 227-239 (1998)) have used a B&B a.gonthm 
(called "A*") to explore the conformational energy surface of protein side-chains. 

, t is straightforward to formulate the s,de chain optimization oroblem for direct optimization by a 
B&B algorithm. All that is necessary is to describe the problem as a search of a combinatory! tree 
where one searches for the single path through the branches that corresponds to the global 
minimum eneroy conformation (GMEC) set of rotamers. The B&B a.gorithm is effective because ,t 
si mu.taneous,y~orunes the tree whi.e searching; each branch is tested with a quantitative bound.ng 
expression before being searched. The addition of "termination" functions as descnbed here.n 
increases the optimization speed dramatically. The description of the B&T algorithm that follows is 
tailored for rotamer selection, but the algorithm is in fact genera.izable to any comb.natonal 
optimziation problem in which all the interaction energ.es are pairwise and pre-computable. The 
bounding expression is similarly general. 

First, a bounding function is used that maximizes the efficiency of pruning for problems in which to 
total energy can be decomposed into interactions between pairs of rotamers. When a 

20 combinatorial tree is used to describe the side chain optimization problem, the root of the tree .s 
p,aced at the top. and branches extend downward. Each level of depth of the tree corresponds to 
an amino acid position, and each node represents a particular rotamer cho-ce at that position. 
Thus a path that extends all the way from the tree root through all levels of branches to a leaf 
describes a complete rotamer sequence. The problem, then, is to search for the path 

25 corresponding to the sequence with the lowest energy. 

A partial path from the root describes a rotamer sequence that is incompletely specified. 
Alternatively, the path can be interpreted physically as specifying a unique composite rotamer, or 
••super-rotamer" that occupies a subset of the amino acid positions. Extending the path deeper ,nto 
the tree corresponds to appending additional rotamers to the super-rotamer. which can be 
30 repeated until all positions are specified. According to this interpretation, a full search of the tree 
would entail the construction of all possible super-rotamers to completion. 

It is often possible, however, to determine that a particular partially-specified super-rotamer ,s not 
part of the GMEC. In such a case, it is unnecessary to explore any combinations that would result 
from building up the super-rotamer further. Applied recursively, such observations prune sub-trees 
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from nodes throughout -he tree, thereby enabling an exhaustive search without complete 
enumeration ef possible super -rot^mers. 

The pruning dCreivniriisUon is a<j;;omplisied by oompsing a lovver energy bound for the partially- 
speoif;ed rc-ta-Ti^r sequence i:* a known reference energy As shown in Equation 13, given a 
5 reference energy of any plausible sequence, it must be true thst the ene*gy of the GMEC is less 
than or equal to the energy of any plausible sequence. 

Equation 18 

E < E 

^CMEC — ^reference 

One may therefore deduce that the global minimum does not contain a particular super-rotamer 
10 upon observing as in Equation 19 tnat the energy E superb€St of the sequence resulting from optimal 
completion of the candidate super-rotamer is greater than the reference energy. 

Equation 19 

r ^ r 

^ super, best ** "reference 

Finding the optimal completing sequence, however, can be as difficult as the original problem, so 
15 we instead construct an expression for a lower energy bound, E superbound . The expression is 
constructed to compute an inexpensive lower energy bound based on the partially specified 
sequence, as well as on the rotamers that are available at the unspecified positions. By definition, 
the bound must satisfy the inequality, 

Equation 20 
E > E 

2Q sun* r fct'jf zuptr. houn J 

With this quantity in hand, we may prune any sub-tree for which we observe that the lower bound is 
greater than the reference energy. 

Equation 21 

^ *>iper bound ^ ^reference 

25 This is the bounding criterion. The Branch-and-Bound algorithm consists of an exhaustive traversal 
of the combinatorial tree, applying this criterion to each node as it is encountered. 



Whenever the search produces a complete path with an energy lower than the current reference 
energy, the reference energy is updated. This way, the effectiveness of the bounding criterion is 
increased over the course of the optimization. Moreover, upon completion of the search, the 
30 reference energy is the global minimum energy. The corresponding sequence is also stored during 
each update, which produces the corresponding GMEC. 
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The successful implementation of a B&B type of algorithm depends largely on the construction of 
the bounding expression. A bounding expression that is very stringent will produce lower bounds 
that are high in energy, and therefore will result in more sub-trees that can be pruned by the 
bounding criterion. The size of the resulting tree will be smaller than one pruned by a less stringent 
5 expression, and the search will be faster. It is therefore important to design the bounding 
expression to most fully utilize the sequence information available. 

On the other hand, stringency is obtained at the cost of time. A maximally stringent bound might 
prune all sub-trees except for the one containing the global minimum, but it would take an 
impractical amount of time to compute. It is therefore also necessary to temper stringency with 
10 speed considerations in order to obtain a bounding expression that is properly balanced for efficient 
searching. 

The construction of such a bounding expression is shown in Example 1. Given a partially 
constructed super-rotamer and the available rotamers at the remaining positions, the approach is 
to utilize the corresponding energetic information as fully as possible while keeping the 
1 5 computational order of the bounding expression constant. The result is a novel, highly-effective 
bounding expression that provides the basis for the remaining B&T techniques. 

An additional advantage of the B&T allogrithm is the form of the resulting expression; it isolates 
those parts of the expression that are identical for rotamers on the same level of a sub-tree. Thus 
it is possible to further increase the efficiency of the search by precomputing these shared 
20 quantities as each group of nodes is encountered, rather than redundantly evaluating the entire 
bounding expression for every unique node. This method is described in Example 1. 

In addition to the bounding function, the invention further provides a termination function, in which 
the bounding function is used to deterministicaily remove rotamers at all amino acid positions, 
thereby reducing the overall size of the tree before searching. Termination is additionally effective 
25 when perfor. ed at every level of recursion of the search, sometimes increasing the overall speed 
of the optimization by an order of magnitude. 

The enhancements of the B&T algorithm relative to the B&B method are based on a process called 
"termination." Because all the pairwise interactions are precomputable, the organization of the 
combinatorial tree is arbitrary (i.e. there is no specific order to which different amino acid positions 
30 must be assigned to different levels of the tree). However, organization of the tree can have a 

significant influence on the speed of the calculation. For example, a greater reduction in the size of 
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tne search is derived from pruning a branch at the root of the tree ratner ;han pruning u branch 
closer to the leaves. Placing a branch at the leaves thai *vculd be pruned ;f p;acec at the root 
would be inefficient Decause tne same pruning step would necessarily bo' repeated tor every leaf. 

In fact, it commonly occurs that all amino acid positions have some rotamers that could be pruned 
5 if placed at the root of the tree. To circumvent the potential loss of efficiency, we implement a pre- 
processing procedure Defore determining the tree organization. This procedure consists of 
temporarily considering each amino acid position to be at the root level and checking it any of its 
rotamers can be immediately pruned. All roiamers pruned from root positions may be completely 
discarded for the remainder of the optimization, and are dubbed "terminated" to reflect this fact. 
10 The result is an cvera!! reduction of the tree size prior to searching, making the optimization faster. 

The selection of the word terminate ' is intended to be contrasted with "eliminate," which is used to 
describe rotamers that are analogously discarded by using the DEE criterion. Indeed, many of the 
same rotamers are discarded. As with DEE, termination may be performed iteratively until no 
further rotamers are terminated Iterative termination is executed as the preprocessing step before 
1 5 search of the tree. 

Although termination serves as an effective preprocessing step, the hallmark of the B&T algorithm 
is that termination is employed at every level of recursion. At any point of the search, the rotamers 
defined at levels above the level of the current smino-acid position may be considered a root 
comprised of a single, partially specified super-rotamer. Termination, then, consists of temporarily 
20 considering each of the rotamers at all the remaining positions as candidates for the next 

appendage of the super-rotamer and applying the bounding criterion to each one. All rotamers 
terminated this way may be discarded from the optimization of tne sub-tree with this partially 
specified super-rotamer root. 

In contrast, the recursive step in a B&B search consists of application of the bounding criterion to 
25 the rotamers at only one amino acid position. The benefits of the extra reductions in the sizes of 
sub-trees far outweigh the costs of calculation of extra bounds for termination. The resulting 
increase in efficiency makes the B&T search significantly faster than a similarly constructed B&B 
search. 

In a preferred embodiment, it is not necessary to perform iterative termination at every level of 
30 recursion, unlike termination preprocessing. A single iteration per branch generally yields the best 
performance. 
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In addition the energetic information produced by the termination process can be used to 
determine the optimal search order for the remainder of the tree. Because termination effectively 
replaces the usual bounding process, the resulting breadth-first algorithm is called "Branch-and- 
Terminate." We also describe a variation of the B&T method that can rapidly find approximate 
5 solutions close to the GMEC. 

When traversing the combinatorial tree, it is necessary to determine (1) the order in which to 
explore rotamers at each position, and (2) the sequence in which to explore the different positions. 
For both cases, we utilize the bounding energies calculated for each rotamer during termination. 

We have observed an empirical correlation between low bounding energy and membership in the 
10 GMEC. Therefore, the rotamers at each position are searched in order of increasing bounding 

energy. Conducting the search in this way increases the chance that solutions close to the GMEC 
are found quickly, thereby providing stringent reference energies early in the calculation. 

With respect to the or dering of the different positions, we construct a heuristic based cn both the 
termination bounding energies and the size of the rotamer lists. In a conventional tree search, the 

15 positions should be organized in order of increasing number of rotamers per position in order to 
minimize the total number of nodes in the tree. However, in a B&T search, there are other 
organization schemes that favor high-level pruning by termination, which reduce the tree size more 
significantly. We use the bounding energy of the top-ranked (lowest bounding energy) rotamer at 
each position to indicate which positions are likely to restrict the rest of the system, and 

20 consequently favor high-level termination if placed at the super-rotamer root. Because the 

minimum operators at a node are applied over a set including the subset corresponding to the sub- 
tree nodes, bounding energies of sub-tree nodes must be higher than or equal to their parent 
nodes. Therefore, placing positions with high lowest-energies at the top of the tree promotes high 
bounding energies for their descendents. Since the rotamer lists of a sub-tree can be significantly 

25 different than those of its parent, residue ordering is performed at every level of recursion depth. 

In a preferred embodiment, an optimal ordering may be obtained by combining energetic and list- 
size sorting criteria using the following heuristic. Positions are sorted in descending order 
according to a rank index, as computed in Equation 22, 



Equation 22 



30 



Rank Index = (1 - /) 



1 + 
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where N is the number of rotamers at the position, E fv is the bounding energy of the top-ranked 
rotamer of that position, and E^ flXti and E toptneK are the minimum and maximum top-ranked 

h\y\ t inrltnn Anomioc r*if rail rN^>r»ifi/^rtf rocna^li\/ol\i Tho ovnrnerinn A 1/ A _i_lr^^/\ i r* ^«Anr«*>-> • * A 
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produce an attenuated weighting inversely proDortional to the number of rotamers that evaluates to 
5 unity when /V=1 The quantity f \s selected to control the relative weighting of the two criteria. A 
value of zero for f corresponds to so«i based entirely on the number of residues per position, and a 
value of one produces a ranking based entirely on bounding energies. 

A solution that is very close to the GMEC sequence can be found very rapidly by using an 
approximate variation of the R&T method Approximate calculations are particularly useful for 
10 providing a fast way to obtain low reference energies for exact B&T optimizations. Moreover, the 
approximate calculation is often sufficient to produce the GMEC energy. 

The approximation is based on the observation that the GMEC rotamers are often amcng those 
with the lowest termination bounding energies according to the bounding expression. This indicates 
that the bounding expression has predictive properties. To rapidly find an approximate solution, 
15 the ranked rotamer lists are arbitrarily truncated after the pre-processing termination step, and the 
B&T search is conducted on the abbreviated set of rotamers. 

A more reliable solution can be found by repeating the approximate optimization with more lenient 
truncation, using the solution from the preceding run for the initial reference energy. 

The description of the Branch-and-Terminate algorithm herein is tailored for rotamer selection, but 
20 the algorithm is in fact generalizable to any combinatorial optimization problem in which all the 
interactions energies are pairwise and pre-computabie. The bounding expression we describe is 
similarly general. 

Although the B&T algorithm can be used by itself, greater benefit can often be obtained by using it 
in concert with a DEE algorithm. Together, the algorithms can solve optimization problems much 
25 more quickly than either can accomplish alone. This may make it possible to quickly find the 

GMEC for protein design problems that were previously insoluble by either algorithm. Perhaps the 
most practical use of the B&T algorithm is to complement DEE when dealing with optimization 
problems that are too difficult to solve using either algorithm alone. 

In a preferred embodiment, the B&T and DEE algorithms are used in succession. DEE is used to 
30 eliminate rotamers and to perform unification until the optimization reaches iterations that are 
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inefficient. Inefficiency typically occurs after several unifications when the total number of rotamers 
and unified super-rotamers gets very large <>5000) and very few eliminations result even from 
lengthy Goldstein doubles calculations. At this stage, the DEE optimization is aborted, and the 
state information is transferred to a B&T implementation. Rotamer lists and energy tables are 
5 transferred directly, including references to unified super-rotamers, which are transparently 
represented as ordinary rotamers in the B&T algorithm. 

An additional performance improvement is obtained by also passing the list of dead-ending pairs 
(DEP). DEP's are pairs of rotamers (or super-rotamers) whose members cannot simultaneously 
exist in the GMEC. These pairs may therefore be safely omitted from the minimum operators in 
10 Equation 39. 

In a preferred embodiment, the computational processing includes one or more Dead-End 
Elimination (DEE) computational steps. The DEE theorem is the basis for a very fast discrete 
search program that was designed to pack protein side chains on a fixed backbone with a known 
sequence. See Desmet, er a/., Nature 356:539-542 (1992); Desmet, et a/., The Proetin Folding 
Problem and Tertiary Structure Prediction , Ch. 10:1-49 (1994); Goldstein, Biophvs. Jour. 66:1335- 
1340 (1994), all of which are incorporated herein by reference. DEE is based on the observation 
that if a rotamer can be eliminated from consideration at a particular position, i.e. make a 
determination that a particular rotamer is definitely not part of the global optimal conformation, the 
size of the search is reduced. This is done by comparing the worst interaction (i.e. energy or E total ) 
of a first rotamer at a single variable position with the best interaction of a second rotamer at the 
same variable position. If the worst interaction of the first rotamer is still better than the best 
interaction of the second rotamer, then the second rotamer cannot possibly be in the optimal 
conformation of the sequence. The original DEE theorem is shown in Equation 23: 

Equation 23 

E(i a ) + Ilmin over t{E(i a , j t )}] > E(i b ) + I[max over t{E(i b , j t )}] 
j j 

In Equation 23, rotamer i a is being compared to rotamer i b . The left side of the inequality is the best 
possible interaction energy (E tota) ) of i a with the rest of the protein; that is, "min over t" means find 
the rotamer t on position j that has the best interaction with rotamer i a . Similarly, the right side of 
30 the inequality is the worst possible (max) interaction energy of rotamer i b with the rest of the 

protein. If this inequality is true, then rotamer i a is Dead-Ending and can be Eliminated. The speed 
of DEE comes from the fact that the theorem only requires sums over the sequence length to test 
and eliminate rotamers. 



20 
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In a preferred embodiment a variation of DEE is performed. Goldstein DEE. based on Goldstein, 
(1994) (supra) ; hereby expressly incorporated by reference, is a variation of the DEE computation, 
as shown in Equation 24: 

Equation 24 

5 E(i a ) - E(i b ) + V[min over t{E(i a , J( ) - E(i b , j t )}] > 0 

In essence, the Goldstein Equation 24 says that a first rotamer a of a particular position 

i (rotamer i a ) will not contribute to a local energy minimum if the energy of conformation with i a can 
always be lowered by just changing the rotamer at that position to i b , keeping the other residues 
equal. If this inequality is true, then rotamer i a is Dead-Ending and can be Eliminated. 

10 Thus, in a preferred embodiment, a first DEE computation is done where rotamers at a single 
variable position are compared, ("singles" DEE) to eliminate rotamers at a single position. This 
analysis is repeated for every variable position, to eliminate as many single rotamers as possible. 
In addition, every time a rotamer is eliminated from consideration through DEE, the minimum and 
maximum calculations of Equation 23, depending on which DEE variation is used, thus conceivably 

15 allowing the elimination of further rotamers. Accordingly, the singles DEE computation can be 
repeated until no more rotamers can be eliminated; that is, when the inequality is not longer true 
such that all of them could conceivably be found on the global optimum. 

In a preferred embodiment, "doubles" DEE is additionally cone. In doubles DEE, pairs of rotamers 
are evaluated; that is, a first rotamer at a first position and a second rotamer at a second position 

20 are compared to a third rotamer at the first position and a fourth rotamer at the second position, 
either using original or Goldstein DEE. Pairs are then flagged as nonallowable, although single 
rotamers cannot be eliminated, only the pair. Again, as for singles DEE, every time a rotamer pair 
is flagged as nonallowable, the minimum calculations of Equation 24 change (depending on which 
DEE variation is used) thus conceivably allowing the flagging of further rotamer pairs. Accordingly, 

25 the doubles DEE computation can be repeated until no more rotamer pairs can be flagged. 

In addition, in a preferred embodiment, rotamer pairs are initially prescreened to eliminate rotamer 
pairs prior to DEE. This is done by doing relatively computationally inexpensive calculations to 
eliminate certain pairs up front. This may be done in several ways, as is outlined below. 
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To search exhaustively for all dead-ending retainers at a residue position i, it is necessary to 
compare every rotamer to every other rotamer available at i. In a comparison matrix, each column 
corresponds to a particular rotamer, i r , as a candidate r for elimiantion, and each row corresponds 
to one of the possible reference rotamers i t . If there are n t rotamers at position i, then an 
5 exhaustive search of n 2 -n matrix elements is necessary. Such a matrix is evaluated for each of the 
positions that may be represented by L 

In a preferred embodiment, the rotamer pair with the lowest interaction energy with the rest of the 
system is found. Inspection of the energy distributions in sample comparison matrices has 
revealed that an ij v pair that dead-end eliminates a particular i r j s pair can also eliminate other i r j s 
10 pairs. In fact, there are often a few ij v pairs, which we call "magic bullets," that eliminate a 

significant number of i r j s pairs. We have found that one of the most potent magic bullets is the pair 
for which maximum interaction energy, e max ([ij v ])k t , is least (see Equations 29-31). This pair is 
referred to as (i j v ) mb If this rotamer pair is used in the first round of doubles DEE, it tends to 
eliminate pairs faster. 

1 5 Our first speed enhancement is to evaluate the first-order doubles calculation for only the matrix 
elements in the row corresponding to the (ij v ) mb pair. The discovery of (i j w ) mb is an n 2 calculation (n 
= the number of rotamers per posiiion), and the application of Equation 24 to the single row of the 
matrix corresponding to this rotamer pair is another n 2 calculation, so the calculation time is small in 
comparison to a full Goldstein calculation. In practice, this calculation produces a large number of 

20 dead-ending pairs, often enough to proceed to the next iteration of singles elimination without any 
further searching of tha doubles matrix. 

The magic bullet Goldstein calculation will also discover all dead-ending pairs that would be 
discovered by the Equation 23 or 24, thereby making it unnecessary. This stems from the fact that 
em a x((a)mb) must be less than or equal to any e max ([ij v J) that would successfully eliminate a pair by 
25 Equations 23 or 24. 

Since the minima and maxima of any given pair has been precalculated as outlined herein, a 
second speed-enhancement precalculation may be done. By comparing extrema, pairs that will 
not dead end can be identified and thus skipped, reducing the time of the DEE calculation. Thus, 
pairs that satisfy either one of the following criteria are skipped: 

30 Equation 25 

e . ( [i j ] )<e ( [i j ] ) 

nun r s nun u v 
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Equation 26: 



max r s max u v 



Because the matrix containing these calculations is symmetrical, half of its elements will satisfy the 
first inequality Equation 25, and half of those remaining will satisfy the other inequality Equation 26. 
These three quarters of the matrix need not be subjected to the evaluation of Equation 23 or 24, 
resulting in a theoretical speed enhancement of a factor of four. 



The last DEE speed enhancement refines the search of the remaining quarter of the matrix. This is 
done by constructing a metric from the precomputed extrema to detect those matrix elements likely 
to result in a dead-ending pair. 



A metric was found through analysis of matrices from different sample optimizations. We searched 
for combinations of the extrema that predicted the likelihood that a matrix element would produce a 
dead-ending pair. Interval sizes (see Figure 4) for each pair were computed from differences of the 
extrema. The size of the overlap of the i j and i j intervals were also computed, as well as 

r s u v 

the difference between the minima and the difference between the maxima. Combinations of these 
quantities, as well as the lone extrema, were tested for their ability to predict the occurrence of 
dead-ending pairs. Because some of the maxima we r e very large, the quantities were also 
compared logarithmically. 



Most of the combinations were able to predict dead-ending matrix elements to varying degrees. 
The best metrics were the fractional inte.*rval overlap with respect to each pair, referred to herein as 
q f$ and q uv . 

Equation 27 

= interval overlap _ e MX < I Vv ] > ([ V, 1 } 



r* interval ( [ i j ] ) e ( [ i j ] ) -e ( [ i j ] ) 

r s max r s min r s 



Equation 28 

q ^ interval overlap _ € max < [ Vv ] > "^n ([ V* 1 > 



■uv interval ( [i j ] ) e ( f i j } ) ~e ( [i j ]) 

" " m " " " min IT - 
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These values are calculated using the minima and maxima equations 29, 30, 31 and 32 (see 
Figure 5): 



Equation 29 



e ( [ i j ] ) =e ( [ i j ] ) + S maxe ( [ i j ] ,k 

max r s r s ... r s t 



Equation 30 



e ( [ i ; ] ) =g ( [ i 7 I ) + S mine ([i j ],Jc) 

min r's r*s , .. r s t 



Eouat ; on 31 



e ( t i j ] ) =e ( [ i j ] ) + L maxe ( [ i j ] , k ) 

max u v u v u v t 



Equation 32 

e . ( [i j ] ) =e ( [ i j ] ) + 2 mine ( [ i j ] , k ) 

min u v u v u v t 



These metrics were selected because they yield ratios of the occurrence of dead-ending matrix 
elements to the total occurrence of elements that are higher than any of the other metrics tested 
For example, there are very few matrix elements (-2%) for which > 0.98, yet these elements 
10 produce 30-40% of all of the dead-ending pairs. 



Accordingly, the first-order doubles criterion is applied only to those doubles for which g ra > 0.98 
and > 0.99. The sample data analyses predict that by using these two metrics, as many as half 
of the dead-ending elements may be found by evaluating only two to five percent of the reduced 
matrix. 



15 Generally, as is more fully described below, single and double DEE, using either or both of original 
DEE and Goldstein DEE, is run iteratively until no further elimination is possible. Usually, 
convergence is not complete, and further elimination must occur to achieve convergence. This is 
generally done using "super residue" DEE. 
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In a preferred embodiment, additional DEE computation is done by the creation of "super residues" 
or "unification", as is generally described in Desmet , Nature 356:539-542 (1992); Desnet, et a/., 
T he Protein Folding Problem and Tertiary Structure Prediction . Ch. i0:i-49 (1994); Goldstein, et 
a/., supra. A super residue is a combination of two or more variable residue positions which is then 
5 treated as a single residue position. The super residue is then evaluated in singles DEE, and 
doubles DEE, with either other residue positions or super residues. The disadvantage of super 
residues is tnat tnere are many more rotamenc states which must be evaluated; that is, if a first 
variabie residue position has 5 possible rotamers, and a second variable residue position has 4 
possible rotamers, there are 20 possible super residue rotamers which must be evaluated. 
10 However, these super residues may be eliminated similar to singles, rather than being flagged like 
pairs. 



The selection of which positions to combine into super residues may be done in a variety of ways. 
In general, random selection of positions tor super residues results in inefficient elimination, but it 
can be done, although this is not preferred. In a preferred embodiment, the first evaluation is the 
1 5 selection of positions for a super residue is the number of rotamers at the position. If the position 
has too many rotamers, it is never unified into a super residue, as the computation becomes too 
unwieldy. Thus, only positions with fewer than about 100,000 rotamers are chosen, with less than 
about 50,000 being preferred and less than about 10,000 being especially preferred. 



In a preferred embodiment, the evaluation of whether to form a super residue is done as follows. 
20 Ail possible rotamer pairs are ranked using Equation 33, and the rotamer pair with the highest 
number is chosen for unification: 



Equation 33 

fraction of flagged pairs 
log(number of super rotamers resulting from the potential unification) 

25 Equation 33 is looking for the pair of positions that has the highest fraction or percentage of flagged 
pairs but the fewest number of super rotamers. That is, the pair that gives the highest value for 
Equation 33 is preferably chosen. Thus, if the pair of positions that has the highest number of 
flagged pairs but also a very large number of super rotamers (that is, the number of rotamers at 
position i times the number of rotamers at position j), this pair may not be chosen (although it 

30 could) over a lower percentage of flagged pairs but fewer super rotamers. 

In an alternate preferred embodiment, positions are chosen for super residues that have the 
highest average energy; that is, for positions i and j, the average energy of all rotamers for i and all 
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rotamers for j is calculated, and the pair with the highest average energy is chosen as a super 
residue- 
Super residues are made one at a time, preferably. After a super residue is chosen, the singles 
and doubles DEE computations are repeated where the super residue is treated as if it were a 
5 regular residue. As for singles and doubles DEE, the elimination of rotamers in the super residue 
DEE will alter the minimum energy calculations of DEE. Thus, repeating singles and/or doubles 
DEE can result in further elimination of rotamers. 

Figure 3 is a detailed illustration of the processing operations associated with a ranking module 34 
of the invention. The calculation and storage of the singles and doubles energies 70 is the first 

10 step, although these may be recalculated every time. Step 72 is the optional application of a cutoff, 
where singles or doubles energies that are too high are eliminated prior to further processing. 
Either or both of original singles DEE 74 or Goldstein singles DEE 76 may be done, with the 
elimination of original singles DEE 74 being generally preferred. Once the singles DEE is run, 
original doubles (78) and/or Goldstein doubles (80) DEE is run. Super residue DEE is then 

15 generally run, either original (82) or Goldstein (84) super residue DEE. This preferably results in 
convergence at a global optimum sequence. As is depicted in Figure 3, after any step any or all of 
the previous steps can be rerun, in any order. 

The addition of super residue DEE to the computational processing, with repetition of the previous 
DEE steps, generally results in convergence at the global optimum. Convergence to the global 
20 optimum is guaranteed if no cutoff applications are made, although generally a global optimum is 
achieved even with these steps. In a preferred embodiment, DEE is run until the global optimum 
sequence is found. That is, the set of optimized protein sequences contains a single member, the 
global optimum. 

In a preferred embodiment, the various DEE steps are run until a managable number of sequences 
25 is found, i.e. no further processing is required. These sequences represent a set of optimized 
protein sequences, and they can be evaluated as is more fully described below. Generally, for 
computational purposes, a manageable number of sequences depends on the length of the 
sequence, but generally ranges from about 1 to about 10 15 possible rotamer sequences. This 
range can be extended to approximately 10 3 ° if B&T is used as the next analyzing step. 

30 Alternatively, DEE is run to a point, resulting in a set of optimized sequences (in this context, a set 
of remainder sequences) and then further compututational processing of a different type may be 
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run. For example, in one emoodiment direct calculation of sequence energv as outlined above is 
done on the remainder possible sequences. Alternatively a Monte Carlo sea r ch can be run. In 
another embodiment. D&T can be run. 



In a preferred embodiment, the computation processing need not comprise a DEE computational 
5 step. In this embodiment, a Monie Carlo search is undertaken, as is known in the arc. See 
Metropolis or a/., J Chem. Phys. 21:1087 (1953), hereby incorporated oy reference. In this 
embodiment, a random sequence comprising random rotamers is chosen as a start point. In one 
embodiment, the variable residue positions are classified as core, boundaiy or surface residues 
and the set ot available rescues at each position is thus defined. Then a random sequence is 

10 generated, and a random roiarner for each annno acid is chosen. This serves as the starting 
sequence o\ the Monte Cano search. A Monte Carlo search then makes a random jump atone 
position, either to a different rotamer of the same amino acid or a rotamer of a different amino acid, 
2nd then a new sequence energy (C tota)sepoenc8 ) is calculated, and if the new sequence energy meets 
the Boltzmann criteria tor acceptance, it is used as the starting point for another jump. If the 

15 Boltzmann test fails, another random jump is attempted from the previous sequence. In this way, 
sequences with lower and lower energies are found, to generate a set of low energy sequences. 

If computational processing results in a single global optimum sequence, it is frequently preferred 
to generate additional sequences in the energy neighborhood of the global solution, which may be 
ranked. These additional sequences are also optimized protein sequences. The generation of 

20 additional optimized sequences is generally preferred so as to evaluate the differences between 

the theoretical and actual energies of a sequence. Generally, in s preferred embodiment, the set of 
sequences is at least about 75% homologous to each other, with at least about 80% homologous 
being preferred, at least about 85% homologous being particularly preferred, and at least about 
90% being especially preferred, in some cases, homology as high as 95% to 98% is desirable. 

25 Homology in this context means sequence similarity or identity, with identity being preferred. 
Identical in this context means identical amino acids at corresponding positions in the two 
sequences which are being compared. Homology in this context includes amino acids which are 
identical and those which are similar (functionally equivalent). This homology will be determined 
using standard techniques known in the art, such as the Best Fit sequence program described by 

30 Devereux, et a/., Nucl. Acid Res. 12:387-395 (1984), or the BLASTX program (Altschul, era/., J, 
MQl Bio! - 215:403-410 (1990)) preferably using the default settings for either. The alignment may 
include the introduction of gaps in the sequences to be aligned. In addition, for sequences which 
contain either more or fewer amino acids than an optimum sequence, ii is understood that the 
percentage of homology will be determined based on the number cf homologous amino acids in 

35 relation to the total number of amino acids. Thus, for example, homology of sequences shorter 
than an optimum will be determined using the number of amino acids in the shorter sequence. 
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Once optimized protein sequences are identified, the processing of Figure 2 optionally proceeds to 
step 56 which entails searching the protein sequences. This processing may be implemented with 
the search module 36. The search module 36 is a set of computer code that executes a search 
strategy. For example, the search module 36 may be written to execute a Monte Carlo search as 
5 described above. Starting with the global solution, random positions are changed to other rotamers 
allowed at the particular position, both rotamers from the same amino acid and rotamers from 
different amino acids. A new sequence energy (E tolalteqtjonc4 ,) is calculated, and if the new sequence 
energy meets the Boltzmann criteria for acceptance, it is used as the starting point for another 
jump. See Metropolis et a/., 1953, supra, hereby incorporated by reference. If the Boltzmann test 

10 fails, another random jump is attempted from the previous sequence. A list of the sequences and 
their energies is maintained during the search. After a predetermined number of jumps, the best 
scoring sequences may be output as a rank-ordered list Preferably, at least about 10 6 jumps are 
made, with at least about 10 7 jumps being preferred and at least about 10 8 jumps being particularly 
preferred Preferably, at least about 100 to 1000 sequences are saved, with at least about 10,000 

15 sequences being preferred and at least about 100,000 to 1 000,000 sequences being especially 
preferred. During the search, the temperature is preferably set to 1000 K. 

Once the Monte Carlo search is over, all of the saved sequences are quenched by changing the 
tagnperature to 0 K, and fixing the amino acid identity at each position. Preferably, every possible 
rotamer jump for that particular amino acid at every position is then tried. 

20 The computational processing results in a set of optimized protein sequences. These optimized 

protein sequences are generally, but not always, significantly different from the wild-type sequence 
from which the backbone was taken. That is, each optimized protein sequence preferably 
comprises at least about 5-10% variant amino acids from the starting or wild-type sequence, with at 
least about 15 -20% changes being preferred and at ieast about 30% changes being particularly 

25 preferred 

These sequences can be used in a number of ways. In a preferred embodiment, one, some or all 
of the optimized protein sequences are constructed into designed proteins, as show with step 58 of 
Figure 2. Thereafter, the protein sequences can be tested, as shown with step 60 of the Figure 2. 
Generally, this can be done in one of two ways. 

30 In a preferred embodiment, the designed proteins are chernicaiiy synthesized as is known in the 

art This is particularly useful when the designed proteins are short, preferably less than 150 amino 
acids in length, with less than 100 amino acids being preferred, and less than 50 amino acids being 
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particularly preferred, although as is known in the art. longer proteins can be made chemically or 
enzymatically. 

In a preferred embodiment, particularly for longer proteins or proteins for which ls r ge samples are 
desired, the optimized sequence is used to create a nucleic acid such as DNA which encodes the 
5 optimized sequence and which can then be cloned into a host cell and expressed. Thus, nucleic 
acids, and particularly DNA, can be made which encodes each optimized protein sequence. This is 
done using well known procedures. The choice of codons, suitable expression vectors and 
suitable host cells will vary depending on a number of factors, and can be easily optimized as 
needed. 

10 Once made-, the deigned proteins are experimentally evaluated and tested for structure, function 
and stability, as required. This wili be done; as is known In the art, and will depend in part on the 
original protein from which the protein backbone structure was taken. Preferably, the designed 
proteins are more stable than the known protein that was used as the starting point, although in 
some cases, if some constaints are placed on the methods, the designed protein may be less 

1 5 stable. Thus, for example, it is possible to fix certain residues for altered biological activity and find 
the most stable sequence, but it may still be less stable than the wild type protein. Stable in this 
context means that the new protein retains either biological activity or conformation past the point 
at which the parent molecule did. Stability includes, but is not limited to, thermal stability, i.e. an 
increase in the temperature at which reversible or irreversible denaturing starts to occur; proteolytic 

20 stability, i.e. a decrease in the amount of protein which is irreversibly cleaved in the presence of a 
particular protease (including autolysis); stability to alterations in pH or oxidative conditions; 
chelator stability; stability to metal ions; stability to solvents such as organic solvents, surfactants, 
formulation chemicals; etc. 

In a preferred embodiment, the modelled proteins are at least about 5% more stable than the 
25 original protein, with at least about 10% being preferred and at least about 20-50% being especially 
preferred. 

The results of the testing operations may be computationally assessed, as shewn with step 62 of 
Figure 2. An assessment module 38 may be used in this operation. That is, computer code may 
be prepared to analyze the test data with respect to any number of rnetrices. 

30 At this processing juncture, if the protein is selected (the yes branch at block 64) then the protein is 
utilized (step 66), as discussed below. If a protein is not selected, the accumulated information 
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may be used to alter the ranking module 34, and/or step 56 is repeated and more sequences are 
searched. 

In a preferred embodiment, the experimental results are used for design feedback and design 
optimization. 

5 Once made, the proteins of the invention find use in a wide variety of applications, as will be 

appreciated by those in the art, ranging from industrial to pharmocologicai uses, depending on the 
protein. Thus, for example, proteins and enzymes exhibiting increased thermal stability may be 
used in industrial processes that are frequently run at elevated temperatures, for example 
carbohydrate processing (including saccharification and liquifaction of starch to produce high 
10 fructose corn syrup and other sweetners), protein processing (for example the use of proteases in 
laundry detergents, food processing, feed stock processing, baking, etc.), etc Similarly, the 
methods of the present invention allow the generation of useful pharmaceutical proteins, such as 
analogs of known proteinaceous drugs which are more thermostable, less proteolyticaily sensitive, 
or contain other desirable changes. 

15 The following examples serve to more fully describe the manner of using the above-described 
invention, as well as to set forth the best modes contemplated for carrying out various aspects of 
the invention. It is understood that these examples in no way serve to limit the true scope of this 
invention, but rather are presented for illustrative purposes. All references cited herein are 
explicitly incorporated by reference in their entirety. 

20 EXAMPLES 

Example 1 

The Generation and Use of B&T 

We tested the generality of the algorithm by applying it to a suite of optimization problems 
representative of different protein structural classes. Rotamers were selected from a backbone 
25 dependent library (Dunbrack, R.L. & Karplus, M., J. Moi. Biol. 230, 543-574 (1993)). To test a- 

helical surface positions, the 12 residues occupying the b, c, and f locations in the heptad repeat of 
one helix of the coi!ed-cold GCN4-p1 dimer (E.K. O'Shea, et al„ Science. 254, 539-544 (1991)) 
were optimized from the set of rotamers corresponding to hydrophilic amino acids (A, D, E, H, K, N, 
Q, R, S, and T). There were 9.1 * 10 22 rotameric combinations. 
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The (31 domain cf streptococcal proton G (Gallagher. T et al., Biochemistr/. ?>2 4721-4728 
(1994)) was used fcr the remaining cnr.es. As ?. representative o* core and toundary optimization 
problems, a subset of positions determined to be in the core and boundary according to our residue 
classification scheme (positions 3, 5, 7, 12, 23, 25, 26, 30 t 34, 43, 45, 52, 54) were optimized from 
5 the 3.4 * 10 25 combinations of hydrophobic rotamers (amino acids A, F, I, L, M, V, W, and Y). For 
P-sheet surfaces, a subset of the P-sheet surface residues (positions 4, 6, 15, 17, 42, 44, 53, 55) 
were optimized from the 4.9 * 10 17 combinations of hydrophiiic rotamers. 

To represent problems consisting of a mixture of different structural types, including turns, we also 
included the optimization of the residues containing any atoms within 10 A of the side-chain atoms 
10 of Val 21 . Of these fourteen, the core residues (positions 3, 20, 36) were allowed to have any of 
the hydrophobic identities, the surface residues (positions 2, 19, 21, 22, 24) had hydrophiiic 
identities, and the remaining boundary residues (positions 1,18, 23, 25, 27, 29) were selected from 
a group of hydrophiiic and hydrophobic residues, excluding methionine (amino acids A, D, E, F, H, 
I, K, L, N, Q, R, S, T, V, W, and Y). Tnere were 1.3 * 1G 29 possible rotamenc combinations. 

15 The most difficult benchmark consisted of all 18 non-glycine core and boundary residues 

(Malakaukas S. & Mayo, S.L, Nsst. Struc. 3io!. 5, 470-475 (1998)). The core residues (positions 3, 
5, 7, 20, 26, 30, 34, 39, 52, 54) v«/ere selected from the se; of hydrophobic amino acids, and the 
boundary residues (positions 1,12, 23, 33, 37, 45 50, 56) were selected from the composite list of 
hydrophiiic and hydrophobic residues. There were 1.S > ID'-* 4 possible rotamenc combinations. 



20 Energy Expression 

We employ an energy expression that consists of van der Waals, electrostatic, and solvation terms. 
For van der Waals, a Lennard-Jones 6-12 potential was used, with radii scaled by a factor of 0.9 
(Dahiyat B.I. & Mayo, S.L., Proc. Nail. Acaa. Sci. USA. 94, 10172-10177 (1997)). Electrostatics were 
computedusing a distancedependent dielectric and a hybridization-dependent hydrogen-bonding term 
25 (Dahiyat, B.I. , eta!., Protein Science. 6, 1333-1337(1997)). Solvation effects were approximated from 
hydrophotr- surface area burial (Street, A.G. & Mayo, S.L, Folding & Design. 3, 253-258 (1998)). 
Atom radii and hydrogen-bond well depths were based on the DREIDING force-field (Mayo, S.L, et 
al., J. Phys. Chem. 94, 8897-8909 (1990)). 

Calculation 

30 For reference, calculation times were recorded using a fully optimized DEE algorithm incorporating 
high energy threshold reduction (HETR) (De Maeyer, M., et al., Folding & Design, 2, 53-56 (1997)) 
and magic bullets and other doubles optimizations (Gordon, D.B. & Mayo, S.L., J. Comp. Chem. 19, 
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1505-1514. (1998)). Calculations were also performed using an enhanced B&B implementation that 
employed the efficient bounding criteria and termination preprocessing. 

For the first three benchmark cases, all calculations were performed using an initial upper bound of 
0.0 kcal/mol, since our energy expression typically results in optimal sequences with negative 
energies. For the remaining two cases, initial bounds were obtained by first running the approximate 
version of the algorithm, in which the rotamer lists were truncated to the fifteen rotamers with the 
lowest bounding energies at each residue position. These provided initial bounds of -153.0 and 
-250.0 kcal/mol, respectively. 



The generality of the sorting criteria was demonstrated by performing optimizations with values of fin 
0 Equation 5 ranging from 0 to 1 . 



To illustrate the reliability of the approximate form of the algorithm, optimizations were also performed 
using only the top 30 rotamers at each position as ranked after a single round of termination. 

The larger benchmark problem consisting of core and boundary residues was used to demonstrate 
how DEE and B&T methods can work in concert. The problem was optimized using a DEE algorithm, 
and upon every reduction of comple<ity by at least and order of magnitude, the state of the diminished 
problem was recorded. A B&T algorithm was used to complete the calculation for each reduced state. 
The calculations were performed using the optimal sorting factor as determined from the previous 
benchmarks. 



For all calculations, the total CPU time was recorded as well as the portions of that time spent 
performing termination preprocessing and the actual recursive search. The total number of nodes 
comprising the final pruned tree was also recorded by tallying the number of nodes remaining after 
termination at every level of recursion. Calculations were performed on a single R10000 CPU of a 
Silicon Graphics Origin 2000. 



Pairwise Bounding Expression 

This section describes the construction of a stringent expression for a lower bound for a system 
composed only of one and two-body interactions in terms of both a partially specified sequence and 
the set of rotamers available at its unspecified positions. 
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For a system consisting only of two-body interactions, the total potential energy can be expressed as 
the sum of energies between all pairs 



5 In a protein, ; and; refer to ammo-acid positions, and £(i,j) is energy of interaction between amino- 
acids at those positions. 

A protein system also consists of single-body interactions. Because each body is an amino-acid side 
chain at a particular position on the protein backbone, tnere is an energy contribution both from side 
chain interactions with other side chains as well as interactions with the protein template scaffolding. 
1 0 Both energies of interaction depend on the side chain position, amino acid identity, and configuration. 
Thus the total potential energy can be expressed, 



where, and c is a position-specif c index describing a side chain rotamer of a particular amino acid type 
15 and configuration. 

For the purposes of deriving an expression tor a lower bound, it is desirable to alter the indices to allow 



Equation 34 




j 



Equation 35 



£,c,i = Z ' opiate) + Z Z E( ^c , j c ) 




redundancy. 



Equation 36 



£ Iotat = Z E (*c > template) + \ 



EZ*(Wc) 



* j 



20 



To ensure that the bounding expression satisfies the condition in Equation 3, we use the following 
inequalities (Equations 37 and 38): 



Equation 37 



m in [£(/ r , template)] < £(/ , template) 

r " 



Equation 38 



min lE(i , j ) ) <E(1 , j ) j 

r r g a a 



49 



BNSDOCID <WO 0116862A2 I > 



WO 01/16862 



PCT/US00/4O805 



in which the indices rand s refer to all of the possible rotamers available at each position, and the 
minimum operator selects the single rotamerthat minimizes the sub-expression. The index g denotes 
the rotamer found at the specified position in the giobai minimum combination. A simple expression 
for the lower bound is therefore obtained by summing minimal interaction energies between positions 
by discovering minimal rotamer-pairs. 

Equation 39 

E£L = Z min[E{i r , template)] + ^ £ min V m jn[£(/ r , j s )] 



The derivation above represents a generic strategy for producing a bounding expression from any total 
energy expression. For example, more restrictive bounds can be obtained from energy expressions 
10 that sum over three or four-body interactions. However, the computational cost to implement such 
bounds on a protein system is very high. Fortunately, there are variations of Equation 35 that are 
equivalent in terms of computational cost yet yield better bounds. 

An alternative way to express the total energy of the system is to distribute the template energies into 
the pair calculation. Given an energy quantity for a pair of rotamers, 

15 Equation 40 

, template) + £(y c , template) | E(i c J c ) 

pair ' Jc ) 2^ 2 2 

in which p is the number of amino acid positions, the total energy can be expressed, 

Equation 41 

^loial = ^ ^ & pair 0'c » Jc ) 
' J 

20 which, in turn, can be used to produce the following bounding expression as shown in Equation 42, 

Equation 42 



E * OUftd =I,min V min [E (i ,j )] 

w i r"j,j*l s pair r s 



Because the minima must be evaluated with respect to single-body and pair energies 
25 simultaneously, this bounding expression is necessarily greater than or equal to the expression in 
Equation 11. Therefore the new bound is more restrictive. The computational requirements for 
both expressions, however, are of the same order. Each requires n 2 p 2 calculations, where n is the 
average number of available rotamers per position, and p is the number of positions. 
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One can derive a tower bound that is even more restrictive by performing an expansion of Equation 
41 before applying the minimization operators. When testing a particular node during traversal of 

node have uniquely specified rotamers, whereas the remaining, deeper nodes are not yet uniquely 
specified. The set of all amino acid positions can therefore be decomposed into two subsets, fixed 
(F) and variable (V). Equation 41 can be rewritten as Equation 43, 



Equation 43 

^lotal = Z Z ^pair ( 7 "c » Jc ) 



Next, we expand the summation to give Equation 44: 
10 Equation 44 

E ^ = Z Z £ P- ('c Jc) + ZI ^pa.r (/, Jc)+SZ ^pa.r ('c > A > + Z Z £ P- ('c > >c ) 

je/-' ye^ /e^ yeA ief ye*' 



Application of the minimum operators to this expression would yield a bounding expression 
equivalent to Equation 42. To increase the stringency, we utilize the inequality, 

Equation 45 

™ n Z E P*r Or J* ) * Z E p*r Or J, ) 

15 ; ; r 

The middle two terms of Equation 44 differ only in their indices, and are therefore equivalent to one 
another. However, there is a difference once the minimum operators are applied, since the rotamers 
of the fixed subset (F) will restrict the selection of :he minimum energy rotamer pair for the minimized 
third term, but not for the second. Therefore, we reverse the order of the summation for the second 
20 term and combine it with the third term to make use of (Equation 45) such that the minimum will be as 
large as possible, 

Equation 46 

E u*m\ = Z Z £ P"r 0c > Jc ) + 2Z Z £ P»r 0 C > A ) + Z Z E P™ 0 c > Jc ) 

t€F ye/- ;eK jzF i t -(' j*V 

J*' j*t 

Now we apply the minimum operators to all sums over positions that are not uniquelv specified. 
25 Equation 47 
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/£L = Z Z £ - ^ • ^ > H 2 Z <™ Z E ^ <'V • A ) + Z ™"Z min £ pair (/ r , J s ) 

/eF jef- teV yeF '€K /e»' 



)* 



To achieve further stringency, we rearrange Equation 23 before applying the minimum operators. 



Equation 48 



£,o,a. = Z Z £ P a.r ('c - A ) + Z i 2 Z ^P- ('"c > 7c ) + Z £ P- 0*c » A ) 



5 From which we obtain, Equation 49 



Equation 49 



j* 



2]T E fMT (/ r , y, ) + £ min[£ pair (i r , y, )] 

yeF y«^ 



The expression is generalizable to any system consisting only of two-body interactions such that the 
total energy of the system can be expressed as in Equation 41 . 

1 0 Efficient Implementation of Bounding Expression 

The computational cost of evaluating Equation 49 is proportional to p 2 n 2 t where p is the number of 
positions and n is the average number of rotamers at each position. When performing termination, the 
bound is evaluated for all pn rotamers, so that the total calculation order for a round of termination is 

15 Termination consists of evaluating the bounding expression for rotamers at all the unspecified 
positions. Therefore, a position is temporarily considered a member of set F while its rotamers are 
being evaluated. Since the expensive second term of the final summation is dependent only on V, its 
possible values may be precomputed for all rotamers i r once per position and placed into a table for 
lookup during the evaluation of Equation 49. 
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The cost of performing p 2 n 2 calculations to assembling the table for the termination of all o positions 
scales as p 3 n 2 The bounding expression now only requires order pn calculations for each of the pn 
times it Is performed, for an overall order of p 2 n 2 . The overall calculation time therefore scales 
approximately as p 3 n 2 , which is nearly n times faster than the direct implementation. Since n is often 
5 as large as 100-200, the speed increase can be drastic. 



Results 

To assess the generality of the B&T approach, different incarnations of the algorithm were applied to 
benchmark problems representing different structural classes, as described in Materials and methods. 
Optimization times were heavily dependent on the sorting heuristic, as shown in Figure 8. The 

1 0 performance improvement, as measured by dividing the total optimization times, ranged from a factor 
of three for the p-sheet case to a factor of over fortv for the "mixed" case. Remarkably, very similar 
values cf the sorting factor f produced the fastest optimization times for alt structural classes. Initially, 
values at intervals cf 0.1 were tested, but since sfl benchmark cases exhibited minima nearf = 0.1, 
values at intervals of 0 01 v/ere sampled nesrth's value. At this level of refinement, the different cases 

15 had different optimal sorting factor values, but a value of f = 0.08 was close to optimal for all of them. 
We also observe that optimizations with the fastest times had the fewest nodes in their pruned 
combinatorial trees. 



The total calculation times for the benchmarks using a sorting factor of 0. 08 are competitive compared 
to tines of a highly optimized DEE: algorithm, and are significantly faster than the optimized B&B 
20 search (Figure 9). For the P-sheet surface and the small ccre-bour.ds.ry calculations, the B&T method 
is approximately twenty times faster than DEE. For the mixed case, it is nearly eight times faster. For 
the a-heltcal case, however, the B&T method is more than two times slower. This is likely a reflection 
of the linear topological arrangement of the system, in which it difficult to select positions to place at 
the tree root that both restrict large parts of the system and are themselves restricted. 



25 The approximate form of the algorithm proved to be exceptionally effective. F:x the four cases above, 
B&T calculations that used only the 30 top-ranked rotamers at each position all took less than fifteen 
seconds and produced the correct GMEC solutions. For the more difficult cere-boundary case, the 
calculation took five minutes, and also produced the correct GMEC solution. For this case, a more 
aggressive calculation using only the top 1 5 rotamers at each position took 25 seconds and produced 

30 a solution whose energy was in error by less than 1 %. This energy was used as the initial bound for 
the remaining calculations on the system. 



53 



BNSDOCID <WO 0116862A2 . > 



WO 01/16862 



PCT/US00/40805 



To illustrate the potential for combining DEE and B&T methods by way of DEE preprocessing, we 
selected a problem computable by either algorithm to enable us to perform quantitative comparisons. 
In practice, however, the technique is applied to problems that are not currently computable in 
reasonable computer time by either algorithm, for which the benefit is obviously much greater. Figure 
5 10 illustrates the total calculation times partitioned into DEE and B&T times for optimization of the 
difficult benchmark consisting of core and boundary residues. The calculations differ in the amount 
of time allotted to DEE reduction before completion with the B&T algorithm. At the best timing, the 
combined algorithms complete the optimization eight times faster than DEE alone. Moreover, we have 
observed that, in practice, the B&T method is generally effective on completing large problems that 
10 DEE can reduce to as high as 10 3 ° remaining sequences. 

Discussion 

We have described a deterministic search method forrotamer optimization, and have demonstrated 
that it is as fast as the current standard algorithm for rotamer optimization in protein design, and in 
some cases, it is much faster. The success of the Branch-and-Terminate method rests on the 
15 construction of a novel pairwise bounding expression, which is used both to perform termination and 
to supply energetic information witn which to determine the search order. Although the algorithm is 
tailored to protein systems, it is generalizable to any problem that can be similarly described. 

Although the B&T algorithm is quite effective when used alone, it is perhaps more important that it 
i ncrea ^ es the problem size limit of DEE calculations by providing an efficient way to complete 
20 optimizations for which elimination criteria have become less effective at removing rotamers. This 
makes it possible to perform optimizations on larger proteins and on systems with large numbers of 
interacting residues. 

The size limit may be raised even higher once the limitations of the approximate form of the algorithm 
become better understood. For the benchmark cases, the approximate algorithm found the GMEC 

25 solutions up to a thousand times faster than either of the exact methods. Even the DEE 
implementation to which the B&T method is compared incorporates some conservative approximations 
in the form of high energy threshold rejection (HETR) criteria (De Maeyer, M., et al., Folding & Design, 
2, 53-56 (1997)). Analogous techniques may provide a way to construct a faster, approximate B&T 
algorithm with a clearly defined accuracy. Along the same line of reasoning, truncation based on 

30 bounding energies might be an effective replacement for HETR cutoffs in DEE. 

There is also room for improvement in the heuristic for determining search order. Heuristics that are 
even more effective may exist that make use of structural information, in addition to energetics and 
size considerations. 
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In addition, we are currently exploring features of the B&T algorithm that are common to all backtrack 
searches. First, it is possible to exhaustively sample the amino acid and rotamer sequence space 
near the GMEC. This is accomplished by modifying the algorithm so that it refrains from lowering the 
initial minimum energy upon finding low energy combinations (Leach, A.R. & Lemon, A.P., Proteins. 
5 33, 227-239 (1998)) The result is a full enumerator, of all sequences wrh energies below the 
specified initial minimum energy, provided that this energy >s c'ose enough to the GMEC energy that 
the calculation remains tractable 

Also, i! »s straightforward to paralleliz.* backtrack algorithms by dispatching branches to different 
CPU's. We observe a scaling efficiency between 60%-80%, depending on the type of problem. 
10 Another advantage of the tree representation is that it makes it possible to estimate how much time 
the optimization wil! require. This is accomplished using a well-known tree estimation technique (Hall, 
M. & Knuth, D.E, Amer. Math. Monthly. 72, 21-28 (1995)) in which statistics are compiled for 
randomsampie trajectories through the tree. This has helped us to predict when it is best to transfer 
DEE problems to B&T for completion. 

15 In practice, we believe that the best way to use the B&T method is to first attempt to optimize a 
problem using DEE. Upon observing that DEE begins to produce very few eliminations or dead-ending 
pairs, the state information should be transferred to an approximate form of the B&T algorithm. Using 
the energy from this calculation as the initial upper bound, the approximate algorithm may be repeated 
again with successively more conservative truncations. The final energy should then be used as the 

20 initial bound for the exact B&T calculation. 
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CLAIMS 

We claim: 

1. A method executed by a computer under the control of a program, said computer including a 
memory for storing said program, said method comprising the steps of: 

5 (A) receiving a protein backbone structure with variable residue positions; 

(B) establishing a group of potential rotamers for each of said variable residue positions, 
wherein at least one variable residue position has rotamers from at least two different amino 
acid side chains; and 

(C) analyzing the interaction of each of said rotamers with all or part of the remainder of said 
0 protein backbone structure to generate a set of optimized protein sequences, wherein said 

analyzing step includes a Branched and Terminate (B&T) computation. 

2. A method according to claim 1 wherein said analyzing step further comprises a DEE computation. 



3. A method according to claim 1 or 2 wherein said set of optimized protein sequences comprises the 
globally optimal protein sequence. 

15 4. A method according to claim 2 wherein said DEE computation is selected from the group consisting 
of original DEE and Goldstein DEE. 

5. A method according to claim 1 or 2 wherein said analyzing step includes the use of at least one 
scoring function. 

6. A method according to claim 5 wherein said scoring function is selected from the group consisting 
20 of a Van der Waals potential scoring function, a hydrogen bond potential scoring function, an atomic 

solvation scoring function, an electrostatic scoring function and a secondary structure propensity 
scoring function. 

7. A method according to claim 6 wherein said analyzing step includes the use of at least two scoring 
functions. 



56 



BNSDOCID: <WO 0116862A2 I > 



WO 01/1 6862 PCT/US00/40805 

8 A method according to claim 6 wherein said analyzing step includes the use of at least three 
scoring functions. 

9. A method according to claim 6 wherein said analyzing step includes the use of at least four scoring 
functions. 

5 10. A method according to claim 6 wherein said atomic solvation scoring function includes a scaling 
factor that compensates for over-counting. 

1 1 . A method according to claim 1 or 2 further comprising testing at least one member of said set to 
produce experimental results 

12. A method according to claim 3 further comprising 

10 (D) generating a rank ordered frst of additional optimal sequences from said globally optimal 

protein sequence. 

13. A method according to claim 12 wherein said generating includes the use of a Monte Carlo 
search. 

14. A method according to claim 1 wherein said analyzing step step comprises a Monte Carlo 
15 computation. 

15. A method accordmg to any of claims 1-14 further comprising: 

(E) testing some or all of said protein sequences from said ordered list to produce potential 
energy test results. 

16. A method according to claim 15 further composing: 

20 ( F ) analyzing the correspondence between said potential energy test results and theoretical 

potential energy data. 

17. A method according to claim 1 further comprising altering at least one supersecondary structure 
parameter value of said protein backbone structure prior to establishing said potential rotamer group. 

18. An optimized protein sequence generated by the method of claim 1. 
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19. A nucleic acid sequence encoding a protein sequence according to claim 18. 



20. An expression vector comprising the nucleic acid of claim 19. 

21. A host cell comprising the nucleic acid of claim 19. 

22. A computer readable memory to direct a computer to function in a specified manner, comprising: 

5 a side chain module to correlate a group of potential rotamers for residue positions of a protein 

backbone model; 

a ranking module to analyze the interaction of each of said rotamers with all or part of the 
remainder of said protein to generate a set of optimized protein sequences wherein said analysis 
includes a B&T computation. 

10 23. A computer readable memory according to claim 22 wherein said ranking module includes a van 
der Waals scoring function component. 

24. A computer readable memory according to claim 22 wherein said ranking module includes an 
atomic solvation scoring function component. 

25. A computer readable memory according to claim 22 wherein said ranking module includes a 
15 hydrogen bond scoring function component. 

26. A computer readable memory according to claim 22 wherein said ranking module includes a 
secondary structure scoring function component. 

27. A computer readable memory according to claim 22 further comprising 

an assessment module to assess the correspondence between potential energy test results 
20 and theoretical potential energy data. 
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BENCHMARK TIMES 
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Total Times [min] 


small C-B a 


cc-surface 


(3- surface 


Mixture Core-Boundary 


DEE b 


177.4 


2.2 


40.5 


101.6 


1154.0 


B&B C 


70.7 


294.8 


44.4 


544.0 


>30000 d 


B&T e 
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6.1 
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B&T Component Times [min] 










Preprocessing 


0.1 


0.1 


0.1 
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Search 
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6.0 


2.0 


12.4 


744.8 


Approximation 1 








0.2 


0.4 


B&T Totai Nodes 


3829 


1697 


1546 


845 


34634 



^Refers to the benchmark comprised of a small set of core and boundary positions. 

b DEE was performed using the speed enhancements in Goldstein, R.R, Biophys. J., 
j 66:1335-1340 (1994) and Gordon & Mayo, J. Comp. Chern., 19:1505-1514 (1998). 
| °The B&B algorithm uses the novel bounding expression and includes termination 
i preprocessing. 

| d For the difficult Core-Boundary case, the incomplete B&B optimization was aborted 
I after 30.000 minutes. 

i e Total B&T time is computed as the sum of the approximation, preprocessing, and 
j search times. 

| 'An approximate B&T algorithm was used to obtain initial bounds for the Mixture 
and difficult Core-Boundary cases. These calculations used only the top thirty rotamers 
at each position according to their bounding energy. 
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